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Machine learning is one of the fastest growing areas of computer science, 
with far-reaching applications. The aim of this textbook is to introduce 
machine learning, and the algorithmic paradigms it offers, in a princi- 
pled way. The book provides an extensive theoretical account of the 
fundamental ideas underlying machine learning and the mathematical 
derivations that transform these principles into practical algorithms. Fol- 
lowing a presentation of the basics of the field, the book covers a wide 
array of central topics that have not been addressed by previous text- 
books. These include a discussion of the computational complexity of 
learning and the concepts of convexity and stability; important algorith- 
mic paradigms including stochastic gradient descent, neural networks, 
and structured output learning; and emerging theoretical concepts such as 
the PAC-Bayes approach and compression-based bounds. Designed for 
an advanced undergraduate or beginning graduate course, the text makes 
the fundamentals and algorithms of machine learning accessible to stu- 
dents and nonexpert readers in statistics, computer science, mathematics, 
and engineering. 
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Preface 


The term machine learning refers to the automated detection of meaningful patterns 
in data. In the past couple of decades it has become a common tool in almost any 
task that requires information extraction from large data sets. We are surrounded 
by a machine learning based technology: Search engines learn how to bring us the 
best results (while placing profitable ads), antispam software learns to filter our e- 
mail messages, and credit card transactions are secured by a software that learns 
how to detect frauds. Digital cameras learn to detect faces and intelligent personal 
assistance applications on smart-phones learn to recognize voice commands. Cars 
are equipped with accident prevention systems that are built using machine learning 
algorithms. Machine learning is also widely used in scientific applications such as 
bioinformatics, medicine, and astronomy. 

One common feature of all of these applications is that, in contrast to more tra- 
ditional uses of computers, in these cases, due to the complexity of the patterns that 
need to be detected, a human programmer cannot provide an explicit, fine-detailed 
specification of how such tasks should be executed. Taking example from intelligent 
beings, many of our skills are acquired or refined through /earning from our experi- 
ence (rather than following explicit instructions given to us). Machine learning tools 
are concerned with endowing programs with the ability to “learn” and adapt. 

The first goal of this book is to provide a rigorous, yet easy to follow, introduction 
to the main concepts underlying machine learning: What is learning? How can a 
machine learn? How do we quantify the resources needed to learn a given concept? 
Is learning always possible? Can we know whether the learning process succeeded or 
failed? 

The second goal of this book is to present several key machine learning algo- 
rithms. We chose to present algorithms that on one hand are successfully used in 
practice and on the other hand give a wide spectrum of different learning tech- 
niques. Additionally, we pay specific attention to algorithms appropriate for large 
scale learning (a.k.a. “Big Data”), since in recent years, our world has become 
increasingly “digitized” and the amount of data available for learning is dramati- 
cally increasing. As a result, in many applications data is plentiful and computation 
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time is the main bottleneck. We therefore explicitly quantify both the amount of 
data and the amount of computation time needed to learn a given concept. 

The book is divided into four parts. The first part aims at giving an initial rigor- 
ous answer to the fundamental questions of learning. We describe a generalization 
of Valiant’s Probably Approximately Correct (PAC) learning model, which is a first 
solid answer to the question “What is learning?” We describe the Empirical Risk 
Minimization (ERM), Structural Risk Minimization (SRM), and Minimum Descrip- 
tion Length (MDL) learning rules, which show “how a machine can learn.” We 
quantify the amount of data needed for learning using the ERM, SRM, and MDL 
rules and show how learning might fail by deriving a “no-free-lunch” theorem. We 
also discuss how much computation time is required for learning. In the second part 
of the book we describe various learning algorithms. For some of the algorithms, 
we first present a more general learning principle, and then show how the algorithm 
follows the principle. While the first two parts of the book focus on the PAC model, 
the third part extends the scope by presenting a wider variety of learning models. 
Finally, the last part of the book is devoted to advanced theory. 

We made an attempt to keep the book as self-contained as possible. However, 
the reader is assumed to be comfortable with basic notions of probability, linear 
algebra, analysis, and algorithms. The first three parts of the book are intended 
for first year graduate students in computer science, engineering, mathematics, or 
statistics. It can also be accessible to undergraduate students with the adequate 
background. The more advanced chapters can be used by researchers intending to 
gather a deeper theoretical understanding. 
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Introduction 


The subject of this book is automated learning, or, as we will more often call it, 
Machine Learning (ML). That is, we wish to program computers so that they can 
“learn” from input available to them. Roughly speaking, learning is the process of 
converting experience into expertise or knowledge. The input to a learning algo- 
rithm is training data, representing experience, and the output is some expertise, 
which usually takes the form of another computer program that can perform some 
task. Seeking a formal-mathematical understanding of this concept, we’ll have to 
be more explicit about what we mean by each of the involved terms: What is the 
training data our programs will access? How can the process of learning be auto- 
mated? How can we evaluate the success of such a process (namely, the quality of 
the output of a learning program)? 


1.1 WHAT IS LEARNING? 


Let us begin by considering a couple of examples from naturally occurring animal 
learning. Some of the most fundamental issues in ML arise already in that context, 
which we are all familiar with. 

Bait Shyness — Rats Learning to Avoid Poisonous Baits: When rats encounter 
food items with novel look or smell, they will first eat very small amounts, and sub- 
sequent feeding will depend on the flavor of the food and its physiological effect. 
If the food produces an ill effect, the novel food will often be associated with the 
illness, and subsequently, the rats will not eat it. Clearly, there is a learning mech- 
anism in play here — the animal used past experience with some food to acquire 
expertise in detecting the safety of this food. If past experience with the food was 
negatively labeled, the animal predicts that it will also have a negative effect when 
encountered in the future. 

Inspired by the preceding example of successful learning, let us demonstrate 
a typical machine learning task. Suppose we would like to program a machine that 
learns how to filter spam e-mails. A naive solution would be seemingly similar to the 
way rats learn how to avoid poisonous baits. The machine will simply memorize all 
previous e-mails that had been labeled as spam e-mails by the human user. When a 
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new e-mail arrives, the machine will search for it in the set of previous spam e-mails. 
If it matches one of them, it will be trashed. Otherwise, it will be moved to the user’s 
inbox folder. 

While the preceding “learning by memorization” approach is sometimes useful, 
it lacks an important aspect of learning systems — the ability to label unseen e-mail 
messages. A successful learner should be able to progress from individual examples 
to broader generalization. This is also referred to as inductive reasoning or inductive 
inference. In the bait shyness example presented previously, after the rats encounter 
an example of a certain type of food, they apply their attitude toward it on new, 
unseen examples of food of similar smell and taste. To achieve generalization in the 
spam filtering task, the learner can scan the previously seen e-mails, and extract a set 
of words whose appearance in an e-mail message is indicative of spam. Then, when 
a new e-mail arrives, the machine can check whether one of the suspicious words 
appears in it, and predict its label accordingly. Such a system would potentially be 
able correctly to predict the label of unseen e-mails. 

However, inductive reasoning might lead us to false conclusions. To illustrate 
this, let us consider again an example from animal learning. 

Pigeon Superstition: In an experiment performed by the psychologist 
B. F. Skinner, he placed a bunch of hungry pigeons in a cage. An automatic mech- 
anism had been attached to the cage, delivering food to the pigeons at regular 
intervals with no reference whatsoever to the birds’ behavior. The hungry pigeons 
went around the cage, and when food was first delivered, it found each pigeon 
engaged in some activity (pecking, turning the head, etc.). The arrival of food rein- 
forced each bird’s specific action, and consequently, each bird tended to spend some 
more time doing that very same action. That, in turn, increased the chance that the 
next random food delivery would find each bird engaged in that activity again. What 
results is a chain of events that reinforces the pigeons’ association of the delivery of 
the food with whatever chance actions they had been performing when it was first 
delivered. They subsequently continue to perform these same actions diligently.! 

What distinguishes learning mechanisms that result in superstition from useful 
learning? This question is crucial to the development of automated learners. While 
human learners can rely on common sense to filter out random meaningless learning 
conclusions, once we export the task of learning to a machine, we must provide 
well defined crisp principles that will protect the program from reaching senseless 
or useless conclusions. The development of such principles is a central goal of the 
theory of machine learning. 

What, then, made the rats’ learning more successful than that of the pigeons? 
As a first step toward answering this question, let us have a closer look at the bait 
shyness phenomenon in rats. 

Bait Shyness revisited — rats fail to acquire conditioning between food and electric 
shock or between sound and nausea: The bait shyness mechanism in rats turns out to 
be more complex than what one may expect. In experiments carried out by Garcia 
(Garcia & Koelling 1996), it was demonstrated that if the unpleasant stimulus that 
follows food consumption is replaced by, say, electrical shock (rather than nausea), 
then no conditioning occurs. Even after repeated trials in which the consumption 


1 See: http://psychclassics.yorku.ca/Skinner/Pigeon 
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of some food is followed by the administration of unpleasant electrical shock, the 
rats do not tend to avoid that food. Similar failure of conditioning occurs when the 
characteristic of the food that implies nausea (such as taste or smell) is replaced 
by a vocal signal. The rats seem to have some “built in” prior knowledge telling 
them that, while temporal correlation between food and nausea can be causal, it is 
unlikely that there would be a causal relationship between food consumption and 
electrical shocks or between sounds and nausea. 

We conclude that one distinguishing feature between the bait shyness learn- 
ing and the pigeon superstition is the incorporation of prior knowledge that biases 
the learning mechanism. This is also referred to as inductive bias. The pigeons in 
the experiment are willing to adopt any explanation for the occurrence of food. 
However, the rats “know” that food cannot cause an electric shock and that the 
co-occurrence of noise with some food is not likely to affect the nutritional value 
of that food. The rats’ learning process is biased toward detecting some kind of 
patterns while ignoring other temporal correlations between events. 

It turns out that the incorporation of prior knowledge, biasing the learning pro- 
cess, is inevitable for the success of learning algorithms (this is formally stated and 
proved as the “No-Free-Lunch theorem” in Chapter 5). The development of tools 
for expressing domain expertise, translating it into a learning bias, and quantifying 
the effect of such a bias on the success of learning is a central theme of the theory 
of machine learning. Roughly speaking, the stronger the prior knowledge (or prior 
assumptions) that one starts the learning process with, the easier it is to learn from 
further examples. However, the stronger these prior assumptions are, the less flex- 
ible the learning is — it is bound, a priori, by the commitment to these assumptions. 
We shall discuss these issues explicitly in Chapter 5. 


1.2 WHEN DO WE NEED MACHINE LEARNING? 


When do we need machine learning rather than directly program our computers to 
carry out the task at hand? Two aspects of a given problem may call for the use of 
programs that learn and improve on the basis of their “experience”: the problem’s 
complexity and the need for adaptivity. 


Tasks That Are Too Complex to Program. 


™ Tasks Performed by Animals/Humans: There are numerous tasks that we 
human beings perform routinely, yet our introspection concerning how 
we do them is not sufficiently elaborate to extract a well defined pro- 
gram. Examples of such tasks include driving, speech recognition, and 
image understanding. In all of these tasks, state of the art machine learn- 
ing programs, programs that “learn from their experience,” achieve quite 
satisfactory results, once exposed to sufficiently many training examples. 

® Tasks beyond Human Capabilities: Another wide family of tasks that ben- 

efit from machine learning techniques are related to the analysis of very 

large and complex data sets: astronomical data, turning medical archives 

into medical knowledge, weather prediction, analysis of genomic data, Web 

search engines, and electronic commerce. With more and more available 
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digitally recorded data, it becomes obvious that there are treasures of mean- 
ingful information buried in data archives that are way too large and too 
complex for humans to make sense of. Learning to detect meaningful pat- 
terns in large and complex data sets is a promising domain in which the 
combination of programs that learn with the almost unlimited memory 
capacity and ever increasing processing speed of computers opens up new 
horizons. 


Adaptivity. One limiting feature of programmed tools is their rigidity — once the 
program has been written down and installed, it stays unchanged. However, 
many tasks change over time or from one user to another. Machine learning 
tools — programs whose behavior adapts to their input data — offer a solution to 
such issues; they are, by nature, adaptive to changes in the environment they 
interact with. Typical successful applications of machine learning to such prob- 
lems include programs that decode handwritten text, where a fixed program can 
adapt to variations between the handwriting of different users; spam detection 
programs, adapting automatically to changes in the nature of spam e-mails; and 
speech recognition programs. 


1.3 TYPES OF LEARNING 


Learning is, of course, a very wide domain. Consequently, the field of machine 
learning has branched into several subfields dealing with different types of learning 
tasks. We give a rough taxonomy of learning paradigms, aiming to provide some 
perspective of where the content of this book sits within the wide field of machine 
learning. 

We describe four parameters along which learning paradigms can be classified. 


Supervised versus Unsupervised Since learning involves an interaction between the 
learner and the environment, one can divide learning tasks according to the 
nature of that interaction. The first distinction to note is the difference between 
supervised and unsupervised learning. As an illustrative example, consider the 
task of learning to detect spam e-mail versus the task of anomaly detection. 
For the spam detection task, we consider a setting in which the learner receives 
training e-mails for which the label spam/not-spam is provided. On the basis of 
such training the learner should figure out a rule for labeling a newly arriving 
e-mail message. In contrast, for the task of anomaly detection, all the learner 
gets as training is a large body of e-mail messages (with no labels) and the 
learner’s task is to detect “unusual” messages. 

More abstractly, viewing learning as a process of “using experience to gain 
expertise,” supervised learning describes a scenario in which the “experience,” 
a training example, contains significant information (say, the spam/not-spam 
labels) that is missing in the unseen “test examples” to which the learned exper- 
tise is to be applied. In this setting, the acquired expertise is aimed to predict 
that missing information for the test data. In such cases, we can think of the 
environment as a teacher that “supervises” the learner by providing the extra 
information (labels). In unsupervised learning, however, there is no distinction 
between training and test data. The learner processes input data with the goal 
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of coming up with some summary, or compressed version of that data. Clus- 
tering a data set into subsets of similar objets is a typical example of such a 
task. 

There is also an intermediate learning setting in which, while the train- 
ing examples contain more information than the test examples, the learner is 
required to predict even more information for the test examples. For exam- 
ple, one may try to learn a value function that describes for each setting of a 
chess board the degree by which White’s position is better than the Black’s. 
Yet, the only information available to the learner at training time is positions 
that occurred throughout actual chess games, labeled by who eventually won 
that game. Such learning frameworks are mainly investigated under the title of 
reinforcement learning. 

Active versus Passive Learners Learning paradigms can vary by the role played 
by the learner. We distinguish between “active” and “passive” learners. An 
active learner interacts with the environment at training time, say, by posing 
queries or performing experiments, while a passive learner only observes the 
information provided by the environment (or the teacher) without influenc- 
ing or directing it. Note that the learner of a spam filter is usually passive 
— waiting for users to mark the e-mails coming to them. In an active set- 
ting, one could imagine asking users to label specific e-mails chosen by the 
learner, or even composed by the learner, to enhance its understanding of what 
spam is. 

Helpfulness of the Teacher When one thinks about human learning, of a baby at 
home or a student at school, the process often involves a helpful teacher, who 
is trying to feed the learner with the information most useful for achieving 
the learning goal. In contrast, when a scientist learns about nature, the envir- 
onment, playing the role of the teacher, can be best thought of as passive — 
apples drop, stars shine, and the rain falls without regard to the needs of the 
learner. We model such learning scenarios by postulating that the training data 
(or the learner’s experience) is generated by some random process. This is the 
basic building block in the branch of “statistical learning.” Finally, learning also 
occurs when the learner’s input is generated by an adversarial “teacher.” This 
may be the case in the spam filtering example (if the spammer makes an effort 
to mislead the spam filtering designer) or in learning to detect fraud. One also 
uses an adversarial teacher model as a worst-case scenario, when no milder 
setup can be safely assumed. If you can learn against an adversarial teacher, 
you are guaranteed to succeed interacting any odd teacher. 

Online versus Batch Learning Protocol The last parameter we mention is the dis- 
tinction between situations in which the learner has to respond online, through- 
out the learning process, and settings in which the learner has to engage the 
acquired expertise only after having a chance to process large amounts of data. 
For example, a stockbroker has to make daily decisions, based on the expe- 
rience collected so far. He may become an expert over time, but might have 
made costly mistakes in the process. In contrast, in many data mining settings, 
the learner — the data miner — has large amounts of training data to play with 
before having to output conclusions. 
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In this book we shall discuss only a subset of the possible learning paradigms. 
Our main focus is on supervised statistical batch learning with a passive learner 
(for example, trying to learn how to generate patients’ prognoses, based on large 
archives of records of patients that were independently collected and are already 
labeled by the fate of the recorded patients). We shall also briefly discuss online 
learning and batch unsupervised learning (in particular, clustering). 


1.4 RELATIONS TO OTHER FIELDS 


As an interdisciplinary field, machine learning shares common threads with the 
mathematical fields of statistics, information theory, game theory, and optimization. 
It is naturally a subfield of computer science, as our goal is to program machines so 
that they will learn. In a sense, machine learning can be viewed as a branch of AI 
(Artificial Intelligence), since, after all, the ability to turn experience into exper- 
tise or to detect meaningful patterns in complex sensory data is a cornerstone of 
human (and animal) intelligence. However, one should note that, in contrast with 
traditional AI, machine learning is not trying to build automated imitation of intel- 
ligent behavior, but rather to use the strengths and special abilities of computers 
to complement human intelligence, often performing tasks that fall way beyond 
human capabilities. For example, the ability to scan and process huge databases 
allows machine learning programs to detect patterns that are outside the scope of 
human perception. 

The component of experience, or training, in machine learning often refers to 
data that is randomly generated. The task of the learner is to process such randomly 
generated examples toward drawing conclusions that hold for the environment from 
which these examples are picked. This description of machine learning highlights its 
close relationship with statistics. Indeed there is a lot in common between the two 
disciplines, in terms of both the goals and techniques used. There are, however, a 
few significant differences of emphasis; if a doctor comes up with the hypothesis 
that there is a correlation between smoking and heart disease, it is the statistician’s 
role to view samples of patients and check the validity of that hypothesis (this is the 
common statistical task of hypothesis testing). In contrast, machine learning aims 
to use the data gathered from samples of patients to come up with a description of 
the causes of heart disease. The hope is that automated techniques may be able to 
figure out meaningful patterns (or hypotheses) that may have been missed by the 
human observer. 

In contrast with traditional statistics, in machine learning in general, and in this 
book in particular, algorithmic considerations play a major role. Machine learning 
is about the execution of learning by computers; hence algorithmic issues are piv- 
otal. We develop algorithms to perform the learning tasks and are concerned with 
their computational efficiency. Another difference is that while statistics is often 
interested in asymptotic behavior (like the convergence of sample-based statisti- 
cal estimates as the sample sizes grow to infinity), the theory of machine learning 
focuses on finite sample bounds. Namely, given the size of available samples, 
machine learning theory aims to figure out the degree of accuracy that a learner 
can expect on the basis of such samples. 
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There are further differences between these two disciplines, of which we shall 
mention only one more here. While in statistics it is common to work under the 
assumption of certain presubscribed data models (such as assuming the normal- 
ity of data-generating distributions, or the linearity of functional dependencies), in 
machine learning the emphasis is on working under a “distribution-free” setting, 
where the learner assumes as little as possible about the nature of the data distribu- 
tion and allows the learning algorithm to figure out which models best approximate 
the data-generating process. A precise discussion of this issue requires some techni- 
cal preliminaries, and we will come back to it later in the book, and in particular in 
Chapter 5. 


1.5 HOW TO READ THIS BOOK 


The first part of the book provides the basic theoretical principles that underlie 
machine learning (ML). In a sense, this is the foundation upon which the rest of 
the book is built. This part could serve as a basis for a minicourse on the theoretical 
foundations of ML. 

The second part of the book introduces the most commonly used algorithmic 
approaches to supervised machine learning. A subset of these chapters may also be 
used for introducing machine learning in a general AI course to computer science, 
Math, or engineering students. 

The third part of the book extends the scope of discussion from statistical clas- 
sification to other learning models. It covers online learning, unsupervised learning, 
dimensionality reduction, generative models, and feature learning. 

The fourth part of the book, Advanced Theory, is geared toward readers who 
have interest in research and provides the more technical mathematical techniques 
that serve to analyze and drive forward the field of theoretical machine learning. 

The Appendixes provide some technical tools used in the book. In particular, we 
list basic results from measure concentration and linear algebra. 

A few sections are marked by an asterisk, which means they are addressed 
to more advanced students. Each chapter is concluded with a list of exercises. A 
solution manual is provided in the course Web site. 


= 


1.5.1 Possible Course Plans Based on This Book 
A 14 Week Introduction Course for Graduate Students: 


Chapters 2—4. 

Chapter 9 (without the VC calculation). 
Chapters 5—6 (without proofs). 

Chapter 10. 

Chapters 7, 11 (without proofs). 

Chapters 12, 13 (with some of the easier proofs). 
Chapter 14 (with some of the easier proofs). 
Chapter 15. 

Chapter 16. 

Chapter 18. 
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11. Chapter 22. 
12. Chapter 23 (without proofs for compressed sensing). 
13. Chapter 24. 
14. Chapter 25. 


A 14 Week Advanced Course for Graduate Students: 


Chapters 26, 27. 
(continued) 
Chapters 6, 28. 
Chapter 7. 
Chapter 31. 
Chapter 30. 
Chapters 12, 13. 
Chapter 14. 

9. Chapter 8. 
10. Chapter 17. 
11. Chapter 29. 
12. Chapter 19. 
13. Chapter 20. 
14. Chapter 21. 


OOS Oy iB Ge 


1.6 NOTATION 


Most of the notation we use throughout the book is either standard or defined on 
the spot. In this section we describe our main conventions and provide a table sum- 
marizing our notation (Table 1.1). The reader is encouraged to skip this section and 
return to it if during the reading of the book some notation is unclear. 

We denote scalars and abstract objects with lowercase letters (e.g. x and A). 
Often, we would like to emphasize that some object is a vector and then we use 
boldface letters (e.g. x and 4). The ith element of a vector x is denoted by x;. We use 
uppercase letters to denote matrices, sets, and sequences. The meaning should be 
clear from the context. As we will see momentarily, the input of a learning algorithm 
is a sequence of training examples. We denote by z an abstract example and by 
S = 2Z1,...,Zm a sequence of m examples. Historically, S is often referred to as a 
training set; however, we will always assume that S is a sequence rather than a set. 
A sequence of m vectors is denoted by x1,...,Xm. The ith element of x; is denoted 
by Xti- 

Throughout the book, we make use of basic notions from probability. We denote 
by D a distribution over some set,’ for example, Z. We use the notation z ~ D to 
denote that z is sampled according to D. Given a random variable f : Z > R, its 
expected value is denoted by E,~p[f(z)]. We sometimes use the shorthand E[ f] 
when the dependence on z is clear from the context. For f : Z — {true, false} we 
also use P,~p[f(z)] to denote D({z : f(z) = true}). In the next chapter we will also 


2 To be mathematically precise, D should be defined over some o-algebra of subsets of Z. The user who 
is not familiar with measure theory can skip the few footnotes and remarks regarding more formal 
measurability definitions and assumptions. 
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Table 1.1. Summary of notation 


symbol 


0,0,0,0, 2, O 
TBoolean expression] 
[a], 

[n] 

x,V,W 

Xj, Uj, Wi 

(x, ¥) 

IxIl2 or [|x| 


P-~D [f(z)] 
Ee~D Lf (z)] 
N(u,C) 
f'(x) 

#'G) 
af) 


Ow; 
V f(w) 
af(w) 
minxec f(x) 
maxyec f(x) 
argmin,<c f(x) 
argmax,<c f(x) 
log 


meaning 


the set of real numbers 

the set of d-dimensional vectors over R 

the set of non-negative real numbers 

the set of natural numbers 

asymptotic notation (see text) 

indicator function (equals 1 if expression is true and 0 o.w.) 
= max{0, a} 

the set {1,...,n} (for n € N) 

(column) vectors 

the ith element of a vector 

= oe x0; (inner product) 

= ,/(x,x) (the €) norm of x) 

= aa |x;| (the €; norm of x) 

= max; |x;| (the £4. norm of x) 

the number of nonzero elements of x 

adxk matrix over R 

the transpose of A 

the (i, 7) element of A 

the d x d matrix A s.t. Aj,; =x;x; (where x ¢ R®) 
a sequence of m vectors 

the jth element of the ith vector in the sequence 
the values of a vector w during an iterative algorithm 


the ith element of the vector w” 

instances domain (a set) 

labels domain (a set) 

examples domain (a set) 

hypothesis class (a set) 

loss function 

a distribution over some set (usually over Z or over 1) 
the probability of a set A C Z according to D 

sampling z according to D 

a sequence of m examples 

sampling S = z1,...,Zm i.i.d. according to D 

probability and expectation of a random variable 

= D({z: f(z) =true}) for f : Z > {true, false} 
expectation of the random variable f:Z—> R 
Gaussian distribution with expectation mw and covariance C 
the derivative of a function f:R—R at x 

the second derivative of a function f: R— R at x 

the partial derivative of a function f :-R? > Rat ww.ret. Wi 
the gradient of a function f:R? > Rat w 

the differential set of a function f :R¢4 > R at w 

= min{ f(x) :x € C} (minimal value of f over C) 

= max{ f(x): x € C} (maximal value of f over C) 

the set {x eC: f(x) =minzcc f(z)} 

the set {x eC: f(x) = maxzec f(z)} 

the natural logarithm 
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introduce the notation D” to denote the probability over Z” induced by sampling 
(z1,---,Zm) where each point z; is sampled from D independently of the other points. 

In general, we have made an effort to avoid asymptotic notation. However, we 
occasionally use it to clarify the main results. In particular, given f : RR > R+ and 
g:R— R, we write f = O(g) if there exist x9,@ € Ry such that for all x > x9 we 
have f(x) <ag(x). We write f = o(g) if for every a > 0 there exists x9 such that for 
all x > x9 we have f(x) <ag(x). We write f = Q(g) if there exist x9, a € R; such that 
for all x > xq we have f(x) > ag(x). The notation f = w(g) is defined analogously. 
The notation f = @(g) means that f = O(g) and g = O(f). Finally, the notation 
f = O(g) means that there exists k € N such that f(x) = O(g(x) log* (g(x))). 

The inner product between vectors x and w is denoted by (x, w). Whenever we 
do not specify the vector space we assume that it is the d-dimensional Euclidean 
space and then (x,w) = Sot xii. The Euclidean (or £2) norm of a vector w is 
lW\l2 = /(w, w). We omit the subscript from the £2 norm when it is clear from the 
context. We also use other £, norms, ||w|| p= ( [wy eye, and in particular ||w/||; = 
2; !wil and ||Wl]oo = max; |wi|- 

We use the notation min,ec f(x) to denote the minimum value of the set 
{ f(x): x € C}. To be mathematically more precise, we should use infy<ec f(x) when- 
ever the minimum is not achievable. However, in the context of this book the 
distinction between infimum and minimum is often of little interest. Hence, to sim- 
plify the presentation, we sometimes use the min notation even when inf is more 
adequate. An analogous remark applies to max versus sup. 
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Let us begin our mathematical analysis by showing how successful learning can be 
achieved in a relatively simplified setting. Imagine you have just arrived in some 
small Pacific island. You soon find out that papayas are a significant ingredient in the 
local diet. However, you have never before tasted papayas. You have to learn how 
to predict whether a papaya you see in the market is tasty or not. First, you need 
to decide which features of a papaya your prediction should be based on. On the 
basis of your previous experience with other fruits, you decide to use two features: 
the papaya’s color, ranging from dark green, through orange and red to dark brown, 
and the papaya’s softness, ranging from rock hard to mushy. Your input for figuring 
out your prediction rule is a sample of papayas that you have examined for color 
and softness and then tasted and found out whether they were tasty or not. Let 
us analyze this task as a demonstration of the considerations involved in learning 
problems. 

Our first step is to describe a formal model aimed to capture such learning tasks. 


2.1 A FORMAL MODEL - THE STATISTICAL LEARNING FRAMEWORK 


The learner’s input: In the basic statistical learning setting, the learner has access 
to the following: 

Domain set: An arbitrary set, 1’. This is the set of objects that we may wish 
to label. For example, in the papaya learning problem mentioned before, 
the domain set will be the set of all papayas. Usually, these domain 
points will be represented by a vector of features (like the papaya’s color 
and softness). We also refer to domain points as instances and to ¥ as 
instance space. 

Label set: For our current discussion, we will restrict the label set to be a 
two-element set, usually {0,1} or {—1,+1}. Let Y denote our set of pos- 
sible labels. For our papayas example, let V be {0,1}, where 1 represents 
being tasty and 0 stands for being not-tasty. 

Training data: S = ((x1, yi)... (%m, Ym)) is a finite sequence of pairs in ¥ x ): 
that is, a sequence of labeled domain points. This is the input that the 
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learner has access to (like a set of papayas that have been tasted and their 
color, softness, and tastiness). Such labeled examples are often called 
training examples. We sometimes also refer to S as a training set.! 

The learner’s output: The learner is requested to output a prediction rule, 
h:X — y. This function is also called a predictor, a hypothesis, or a classifier. 
The predictor can be used to predict the label of new domain points. In our 
papayas example, it is a rule that our learner will employ to predict whether 
future papayas he examines in the farmers’ market are going to be tasty or not. 
We use the notation A(S) to denote the hypothesis that a learning algorithm, 
A, returns upon receiving the training sequence S. 

A simple data-generation model We now explain how the training data is gen- 
erated. First, we assume that the instances (the papayas we encounter) are 
generated by some probability distribution (in this case, representing the 
environment). Let us denote that probability distribution over V by D. It is 
important to note that we do not assume that the learner knows anything about 
this distribution. For the type of learning tasks we discuss, this could be any 
arbitrary probability distribution. As to the labels, in the current discussion 
we assume that there is some “correct” labeling function, f : 1 — Y, and that 
y; = f(x;) for all i. This assumption will be relaxed in the next chapter. The 
labeling function is unknown to the learner. In fact, this is just what the learner 
is trying to figure out. In summary, each pair in the training data S is generated 
by first sampling a point x; according to D and then labeling it by f. 

Measures of success: We define the error of a classifier to be the probability that 
it does not predict the correct label on a random data point generated by the 
aforementioned underlying distribution. That is, the error of / is the proba- 
bility to draw a random instance x, according to the distribution D, such that 
h(x) does not equal f(x). 

Formally, given a domain subset,’ A C 4, the probability distribution, D, 
assigns a number, D(A), which determines how likely it is to observe a point 
x € A. In many cases, we refer to A as an event and express it using a function 
xz: X — {0,1}, namely, A = {x € X : (x) = 1}. In that case, we also use the 
notation P,.~p [2 (x)] to express D(A). 

We define the error of a prediction rule, h: ¥ > Y, to be 


def def 


Lop(h) SP h(x) ¢ FO] S DGesh) 4# FON). 2A) 


That is, the error of such h is the probability of randomly choosing an example 
x for which h(x) 4 f(x). The subscript (D, f) indicates that the error is mea- 
sured with respect to the probability distribution D and the correct labeling 
function f. We omit this subscript when it is clear from the context. Lp, f)(h) 
has several synonymous names such as the generalization error, the risk, or 
the true error of h, and we will use these names interchangeably throughout 


' Despite the “set” notation, S is a sequence. In particular, the same example may appear twice in S and 
some algorithms can take into account the order of examples in S. 

? Strictly speaking, we should be more careful and require that A is a member of some o-algebra of 
subsets of 4’, over which D is defined. We will formally define our measurability assumptions in the 
next chapter. 
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the book. We use the letter L for the error, since we view this error as the loss 
of the learner. We will later also discuss other possible formulations of such 
loss. 


A note about the information available to the learner The learner is blind to the 
underlying distribution D over the world and to the labeling function /. In our 
papayas example, we have just arrived in a new island and we have no clue 
as to how papayas are distributed and how to predict their tastiness. The only 
way the learner can interact with the environment is through observing the 
training set. 


In the next section we describe a simple learning paradigm for the preceding 
setup and analyze its performance. 


2.2 EMPIRICAL RISK MINIMIZATION 


As mentioned earlier, a learning algorithm receives as input a training set S, sam- 
pled from an unknown distribution D and labeled by some target function f, and 
should output a predictor hs : ¥ — ) (the subscript S emphasizes the fact that 
the output predictor depends on S). The goal of the algorithm is to find hs that 
minimizes the error with respect to the unknown D and f. 

Since the learner does not know what D and /f are, the true error is not directly 
available to the learner. A useful notion of error that can be calculated by the 
learner is the training error — the error the classifier incurs over the training sample: 


Ls(h) def le ata ate 


(2.2) 
where [m] = {1,...,m}. 

The terms empirical error and empirical risk are often used interchangeably for 
this error. 

Since the training sample is the snapshot of the world that is available to the 
learner, it makes sense to search for a solution that works well on that data. This 
learning paradigm — coming up with a predictor A that minimizes Ls(h) — is called 
Empirical Risk Minimization or ERM for short. 


2.2.1 Something May Go Wrong - Overfitting 


Although the ERM rule seems very natural, without being careful, this approach 
may fail miserably. 

To demonstrate such a failure, let us go back to the problem of learning to pre- 
dict the taste of a papaya on the basis of its softness and color. Consider a sample as 
depicted in the following: 
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Assume that the probability distribution D is such that instances are distributed 
uniformly within the gray square and the labeling function, f, determines the label 
to be 1 if the instance is within the inner square, and 0 otherwise. The area of the 
gray square in the picture is 2 and the area of the inner square is 1. Consider the 
following predictor: 


ite + if i € [m] s.t. xj =x (2.3) 


0 otherwise. 


While this predictor might seem rather artificial, in Exercise 2.1 we show a natu- 
ral representation of it using polynomials. Clearly, no matter what the sample is, 
Ls(hs) = 0, and therefore this predictor may be chosen by an ERM algorithm (it is 
one of the empirical-minimum-cost hypotheses; no classifier can have smaller error). 
On the other hand, the true error of any classifier that predicts the label 1 only ona 
finite number of instances is, in this case, 1/2. Thus, Lp(hs) = 1/2. We have found 
a predictor whose performance on the training set is excellent, yet its performance 
on the true “world” is very poor. This phenomenon is called overfitting. Intuitively, 
overfitting occurs when our hypothesis fits the training data “too well” (perhaps like 
the everyday experience that a person who provides a perfect detailed explanation 
for each of his single actions may raise suspicion). 


2.3 EMPIRICAL RISK MINIMIZATION WITH INDUCTIVE BIAS 


We have just demonstrated that the ERM rule might lead to overfitting. Rather 
than giving up on the ERM paradigm, we will look for ways to rectify it. We will 
search for conditions under which there is a guarantee that ERM does not overfit, 
namely, conditions under which when the ERM predictor has good performance 
with respect to the training data, it is also highly likely to perform well over the 
underlying data distribution. 

A common solution is to apply the ERM learning rule over a restricted search 
space. Formally, the learner should choose in advance (before seeing the data) a set 
of predictors. This set is called a hypothesis class and is denoted by H. Each h € H 
is a function mapping from ¥ to ¥. For a given class H, and a training sample, S, 
the ERM, learner uses the ERM rule to choose a predictor h € H, with the lowest 
possible error over S. Formally, 


ERM3(S) € argmin Ls(h), 
heH 


where argmin stands for the set of hypotheses in H that achieve the minimum value 
of Ls(h) over H. By restricting the learner to choosing a predictor from H, we bias it 
toward a particular set of predictors. Such restrictions are often called an inductive 
bias. Since the choice of such a restriction is determined before the learner sees the 
training data, it should ideally be based on some prior knowledge about the problem 
to be learned. For example, for the papaya taste prediction problem we may choose 
the class H to be the set of predictors that are determined by axis aligned rectangles 
(in the space determined by the color and softness coordinates). We will later show 
that ERM, over this class is guaranteed not to overfit. On the other hand, the 
example of overfitting that we have seen previously, demonstrates that choosing H 
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to be a class of predictors that includes all functions that assign the value 1 to a finite 
set of domain points does not suffice to guarantee that ERM, will not overfit. 

A fundamental question in learning theory is, over which hypothesis classes 
ERM, learning will not result in overfitting. We will study this question later in 
the book. 

Intuitively, choosing a more restricted hypothesis class better protects us against 
overfitting but at the same time might cause us a stronger inductive bias. We will get 
back to this fundamental tradeoff later. 


2.3.1 Finite Hypothesis Classes 


The simplest type of restriction on a class is imposing an upper bound on its size 
(that is, the number of predictors A in #1). In this section, we show that if H is a 
finite class then ERM, will not overfit, provided it is based on a sufficiently large 
training sample (this size requirement will depend on the size of #1). 

Limiting the learner to prediction rules within some finite hypothesis class may 
be considered as a reasonably mild restriction. For example, H can be the set of all 
predictors that can be implemented by a C++ program written in at most 10? bits 
of code. In our papayas example, we mentioned previously the class of axis aligned 
rectangles. While this is an infinite class, if we discretize the representation of real 
numbers, say, by using a 64 bits floating-point representation, the hypothesis class 
becomes a finite class. 

Let us now analyze the performance of the ERM» learning rule assuming that 
H is a finite class. For a training sample, S, labeled according to some f : ¥ > J, let 
hs denote a result of applying ERMz, to S, namely, 


hs € argminLs(h). (2.4) 
heH 
In this chapter, we make the following simplifying assumption (which will be 
relaxed in the next chapter). 


Definition 2.1 (The Realizability Assumption). There exists h* € H s.t. 
Lp, f)(h*) = 0. Note that this assumption implies that with probability 1 over ran- 
dom samples, S, where the instances of S are sampled according to D and are labeled 
by f, we have Ls(h*) =0. 


The realizability assumption implies that for every ERM hypothesis we have 
that’ Ls(hs) = 0. However, we are interested in the srue risk of hs, Lip, s)(hs), 
rather than its empirical risk. 

Clearly, any guarantee on the error with respect to the underlying distribution, 
D, for an algorithm that has access only to a sample S should depend on the rela- 
tionship between D and S. The common assumption in statistical machine learning 
is that the training sample S is generated by sampling points from the distribution D 
independently of each other. Formally, 


3 Mathematically speaking, this holds with probability 1. To simplify the presentation, we sometimes 
omit the “with probability 1” specifier. 
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The i.i.d. assumption: The examples in the training set are independently and 
identically distributed (i.i.d.) according to the distribution D. That is, every 
x; in S is freshly sampled according to D and then labeled according to the 
labeling function, f. We denote this assumption by S ~ D” where m is the 
size of S, and D” denotes the probability over m-tuples induced by applying D 
to pick each element of the tuple independently of the other members of the 
tuple. 

Intuitively, the training set S is a window through which the learner gets 
partial information about the distribution D over the world and the labeling 
function, f. The larger the sample gets, the more likely it is to reflect more 
accurately the distribution and labeling used to generate it. 


Since Lp, ¢)(4s) depends on the training set, S, and that training set is picked by 
a random process, there is randomness in the choice of the predictor hs and, conse- 
quently, in the risk Lp, ¢)(As). Formally, we say that it is a random variable. It is not 
realistic to expect that with full certainty S will suffice to direct the learner toward 
a good classifier (from the point of view of D), as there is always some probability 
that the sampled training data happens to be very nonrepresentative of the under- 
lying D. If we go back to the papaya tasting example, there is always some (small) 
chance that all the papayas we have happened to taste were not tasty, in spite of the 
fact that, say, 70% of the papayas in our island are tasty. In such a case, ERM7(S) 
may be the constant function that labels every papaya as “not tasty” (and has 70% 
error on the true distribution of papapyas in the island). We will therefore address 
the probability to sample a training set for which Lip, f)(hs) is not too large. Usu- 
ally, we denote the probability of getting a nonrepresentative sample by 6, and call 
(1 — 6) the confidence parameter of our prediction. 

On top of that, since we cannot guarantee perfect label prediction, we introduce 
another parameter for the quality of prediction, the accuracy parameter, commonly 
denoted by e. We interpret the event Lip, s)(hs) > € as a failure of the learner, while 
if L(p, (hs) < € we view the output of the algorithm as an approximately correct 
predictor. Therefore (fixing some labeling function f : ¥ > )), we are interested 
in upper bounding the probability to sample m-tuple of instances that will lead to 
failure of the learner. Formally, let S|, =(x1,...,%m) be the instances of the training 
set. We would like to upper bound 


D" ({Slx: Lp, py(hs) > €}). 
Let H, be the set of “bad” hypotheses, that is, 
Hp = {he H: Lyp,py(h) > €}. 
In addition, let 
M ={S|,:5h € Hg, Ls(h) =0} 


be the set of misleading samples: Namely, for every S|, € M, there is a “bad” hypoth- 
esis, h € Hp, that looks like a “good” hypothesis on S|,. Now, recall that we would 
like to bound the probability of the event Lip, (hs) > €. But, since the realizabil- 
ity assumption implies that Ls(hs) = 0, it follows that the event Lip, ¢)(hs) > € can 
only happen if for some h € Hg we have Ls(h) = 0. In other words, this event will 
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only happen if our sample is in the set of misleading samples, M. Formally, we have 
shown that 


{Slx: Lip, py(hs) > €} <M. 


Note that we can rewrite M as 


M= U {S|, : Ls(h) = 0}. (2.5) 
heHg 


Hence, 
D"({Slx: Lp, py(hs) > €}) < D"(M) =D" (Uncen, {Slx : Ls(h) = 0}). (2.6) 


Next, we upper bound the right-hand side of the preceding equation using the 
union bound — a basic property of probabilities. 


Lemma 2.2 (Union Bound). For any two sets A, B and a distribution D we have 
D(AUB) <D(A)+D(B). 
Applying the union bound to the right-hand side of Equation (2.6) yields 


D" ({Slx: Lop, (as) > €}) S$ SY > D"({S]x 2 Ls(h) =0}). (2.7) 
heHeg 


Next, let us bound each summand of the right-hand side of the preceding inequality. 
Fix some “bad” hypothesis h € Hg. The event Ls(h) = 0 is equivalent to the event 
Vi, h(x;) = f(x;). Since the examples in the training set are sampled i.i.d. we get that 


D"({Slx : Ls(h) = 0}) =D" ((S|x: Vi, h(x) = F)!) 
=[[ D(x: :h@i) = FQ). (2.8) 
i=1 


For each individual sampling of an element of the training set we have 
D({xi h(x) = yi}) =1- Lip, py(h) <l-e, 


where the last inequality follows from the fact that h €¢ Hg. Combining the previous 
equation with Equation (2.8) and using the inequality 1 — « < e~* we obtain that for 
every h € Hp, 


D” ({S|, : Ls(h) = 0}) < 1-6)" <e*™. (2.9) 
Combining this equation with Equation (2.7) we conclude that 
D™ ({Slx : Lp, p(s) > €}) < lHple <” < lH|e <™. 


A graphical illustration which explains how we used the union bound is given in 
Figure 2.1. 


Corollary 2.3. Let H. be a finite hypothesis class. Let 6 € (0,1) and « > O and let m be 
an integer that satisfies 
py = OBUIPI/S) 
€ 
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Figure 2.1. Each point in the large circle represents a possible m-tuple of instances. Each 
colored oval represents the set of “misleading” m-tuple of instances for some “bad” pre- 
dictor h € Hg. The ERM can potentially overfit whenever it gets a misleading training set 
S. That is, for some h € Hg we have Ls(h) = 0. Equation (2.9) guarantees that for each 
individual bad hypothesis, h € Hg, at most (1—«)”-fraction of the training sets would be 
misleading. In particular, the larger m is, the smaller each of these colored ovals becomes. 
The union bound formalizes the fact that the area representing the training sets that are 
misleading with respect to some h € Hz» (that is, the training sets in M) is at most the 
sum of the areas of the colored ovals. Therefore, it is bounded by |#/,| times the maximum 
size of a colored oval. Any sample S outside the colored ovals cannot cause the ERM rule 
to overfit. 


Then, for any labeling function, f, and for any distribution, D, for which the realiz- 
ability assumption holds (that is, for some h € H, Lip, ¢)(h) = 0), with probability of 
at least 1 — 5 over the choice of an i.i.d. sample S of size m, we have that for every 
ERM hypothesis, hs, it holds that 


Lip, py(hs) <e. 


The preceeding corollary tells us that for a sufficiently large m, the ERM, rule 
over a finite hypothesis class will be probably (with confidence 1 — 6) approximately 
(up to an error of €) correct. In the next chapter we formally define the model of 
Probably Approximately Correct (PAC) learning. 


2.4 EXERCISES 


2.1 Overfitting of polynomial matching: We have shown that the predictor defined in 
Equation (2.3) leads to overfitting. While this predictor seems to be very unnatural, 
the goal of this exercise is to show that it can be described as a thresholded poly- 
nomial. That is, show that given a training set S = {(x;, f(x;))}"_, € (R4 x {0, 1)”, 
there exists a polynomial ps such that hs(x) = 1 if and only if ps(x) => 0, where hs 
is as defined in Equation (2.3). It follows that learning the class of all thresholded 
polynomials using the ERM rule may lead to overfitting. 

2.2 Let H be a class of binary classifiers over a domain 1. Let D be an unknown distri- 
bution over ¥, and let f be the target hypothesis in H. Fix some h € H. Show that 
the expected value of Ls(h) over the choice of S|, equals Lp, (2), namely, 


siepales1=Leo.n(h). 


2.3 Axis aligned rectangles: An axis aligned rectangle classifier in the plane is a classi- 
fier that assigns the value 1 to a point if and only if it is inside a certain rectangle. 
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R(S 


Figure 2.2. Axis aligned rectangles. 
Formally, given real numbers a < 51, a2 < bz, define the classifier h(a, 5, ,a),b)) by 


1 if ay <x, <by and az < x2 <b (2.10) 


h(a, ,b ,a9,b2) (41 : x2) - {0 otherwise 


The class of all axis aligned rectangles in the plane is defined as 
Hee = {h(a,,b1,a,b) 11 Sb1, and az < by}. 


Note that this is an infinite size hypothesis class. Throughout this exercise we rely 

on the realizability assumption. 

1. Let A be the algorithm that returns the smallest rectangle enclosing all positive 
examples in the training set. Show that A is an ERM. 

2. Show that if A receives a training set of size > Hog (4/8) then, with probability of 
at least 1 — 6 it returns a hypothesis with error of at most e. 
Hint: Fix some distribution D over 1, let R* = R(aj, b},a;,b>) be the rectan- 
gle that generates the labels, and let f be the corresponding hypothesis. Let 
a, > aj be a number such that the probability mass (with respect to D) of the 
rectangle R; = R(aj,a1, a5, b5) is exactly €/4. Similarly, let b), a2, b2 be numbers 
such that the probability masses of the rectangles Ro = R(b1, by}, 45,63), R3 = 
R(aj, bf, a5,a2), Ra = R(aj, b{,b2,b5) are all exactly €/4. Let R(S) be the 
rectangle returned by A. See illustration in Figure 2.2. 
M®@ Show that R(S) C R*. 
M® Show that if S contains (positive) examples in all of the rectangles 

R1, Ro, R3, Ra, then the hypothesis returned by A has error of at most e. 
®@ For each i € {1,...,4}, upper bound the probability that S$ does not contain 
an example from R;. 

™ Use the union bound to conclude the argument. 

3. Repeat the previous question for the class of axis aligned rectangles in R?. 

4. Show that the runtime of applying the algorithm A mentioned earlier is polyno- 
mial in d, 1/e, and in log(1/6). 
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In this chapter we define our main formal learning model — the PAC learning model 
and its extensions. We will consider other notions of learnability in Chapter 7. 


3.1 PAC LEARNING 


In the previous chapter we have shown that for a finite hypothesis class, if the ERM 
rule with respect to that class is applied on a sufficiently large training sample (whose 
size is independent of the underlying distribution or labeling function) then the out- 
put hypothesis will be probably approximately correct. More generally, we now 
defineProbably Approximately Correct (PAC) learning. 


Definition 3.1 (PAC Learnability). A hypothesis class H is PAC learnable if there 
exist a function mz : (0,1)? > N and a learning algorithm with the following prop- 
erty: For every €,6 € (0,1), for every distribution D over ¥, and for every labeling 
function f : ¥ — {0,1}, if the realizable assumption holds with respect to H,D, f, 
then when running the learning algorithm on m > m(e, 5) iid. examples gener- 
ated by D and labeled by f, the algorithm returns a hypothesis / such that, with 
probability of at least 1 — 6 (over the choice of the examples), Lp, ¢)(1) < €. 


The definition of Probably Approximately Correct learnability contains two 
approximation parameters. The accuracy parameter € determines how far the out- 
put classifier can be from the optimal one (this corresponds to the “approximately 
correct”), and a confidence parameter 6 indicating how likely the classifier is to meet 
that accuracy requirement (corresponds to the “probably” part of “PAC”). Under 
the data access model that we are investigating, these approximations are inevitable. 
Since the training set is randomly generated, there may always be a small chance that 
it will happen to be noninformative (for example, there is always some chance that 
the training set will contain only one domain point, sampled over and over again). 
Furthermore, even when we are lucky enough to get a training sample that does 
faithfully represent D, because it is just a finite sample, there may always be some 
fine details of D that it fails to reflect. Our accuracy parameter, €, allows “forgiving” 
the learner’s classifier for making minor errors. 
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3.2.1 Releasing the Realizability Assumption — Agnostic 
PAC Learning 


A More Realistic Model for the Data-Generating Distribution 

Recall that the realizability assumption requires that there exists h* € H such that 
Py~p [h*(x) = f(x)] =1. In many practical problems this assumption does not hold. 
Furthermore, it is maybe more realistic not to assume that the labels are fully deter- 
mined by the features we measure on input elements (in the case of the papayas, 
it is plausible that two papayas of the same color and softness will have differ- 
ent taste). In the following, we relax the realizability assumption by replacing the 
“target labeling function” with a more flexible notion, a data-labels generating 
distribution. 

Formally, from now on, let D be a probability distribution over V x Y, where, 
as before, V is our domain set and ¥ is a set of labels (usually we will consider 
Y = {0, 1}). That is, D is a joint distribution over domain points and labels. One can 
view such a distribution as being composed of two parts: a distribution D, over unla- 
beled domain points (sometimes called the marginal distribution) and a conditional 
probability over labels for each domain point, D((x, y)|x). In the papaya example, 
D, determines the probability of encountering a papaya whose color and hardness 
fall in some color-hardness values domain, and the conditional probability is the 
probability that a papaya with color and hardness represented by x is tasty. Indeed, 
such modeling allows for two papayas that share the same color and hardness to 
belong to different taste categories. 


The empirical and the True Error Revised 

For a probability distribution, D, over VY x Y, one can measure how likely h is to 
make an error when labeled points are randomly drawn according to D. We redefine 
the true error (or risk) of a prediction rule / to be 


def 


Loh) © [a(x) #y] & D({(x, y) A(x) Fy). (3.1) 


P 
(x.y)~D 
We would like to find a predictor, 4, for which that error will be minimized. 
However, the learner does not know the data generating D. What the learner does 
have access to is the training data, S. The definition of the empirical risk remains 
the same as before, namely, 
det I{i € [m]: h(i) ¥ yi} | 


Given S, a learner can compute Ls(h) for any function A: X — {0,1}. Note that 
Ls(h) = L p(uniform over s)(A). 


The Goal 
We wish to find some hypothesis,  : ¥ > Y, that (probably approximately) 
minimizes the true risk, Lp(h). 
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will be able to classify given documents according to topics (e.g., news, sports, 
biology, medicine). A learning algorithm for such a task will have access to 
examples of correctly classified documents and, on the basis of these examples, 
should output a program that can take as input a new document and output 
a topic classification for that document. Here, the domain set is the set of all 
potential documents. Once again, we would usually represent documents by a 
set of features that could include counts of different key words in the document, 
as well as other possibly relevant features like the size of the document or its ori- 
gin. The label set in this task will be the set of possible document topics (so Y will 
be some large finite set). Once we determine our domain and label sets, the other 
components of our framework look exactly the same as in the papaya tasting 
example; Our training sample will be a finite sequence of (feature vector, label) 
pairs, the learner’s output will be a function from the domain set to the label 
set, and, finally, for our measure of success, we can use the probability, over 
(document, topic) pairs, of the event that our predictor suggests a wrong label. 

®™ Regression In this task, one wishes to find some simple pattern in the data -a 
functional relationship between the ¥ and Y components of the data. For exam- 
ple, one wishes to find a linear function that best predicts a baby’s birth weight 
on the basis of ultrasound measures of his head circumference, abdominal cir- 
cumference, and femur length. Here, our domain set ¥ is some subset of R? (the 
three ultrasound measurements), and the set of “labels,” 3, is the the set of real 
numbers (the weight in grams). In this context, it is more adequate to call Y the 
target set. Our training data as well as the learner’s output are as before (a finite 
sequence of (x,y) pairs, and a function from ¥ to Y respectively). However, 
our measure of success is different. We may evaluate the quality of a hypothesis 
function, h: ¥ > J, by the expected square difference between the true labels 
and their predicted values, namely, 


Lothy = 


h (A(x) —y). 3.2 
cE pi@)-») (3.2) 

To accommodate a wide range of learning tasks we generalize our formalism of 
the measure of success as follows: 


Generalized Loss Functions 

Given any set H (that plays the role of our hypotheses, or models) and some domain 
Z let € be any function from H x Z to the set of nonnegative real numbers, ¢ : 
H. x Z > R,. We call such functions loss functions. 

Note that for prediction problems, we have that Z = ¥Y x Y. However, our notion 
of the loss function is generalized beyond prediction tasks, and therefore it allows 
Z to be any domain of examples (for instance, in unsupervised learning tasks such 
as the one described in Chapter 22, Z is not a product of an instance domain and a 
label domain). 

We now define the risk function to be the expected loss of a classifier, h € H, with 
respect to a probability distribution D over Z, namely, 


Lat) = 2 [e(h.z)]- (3.3) 


mw 
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That is, we consider the expectation of the loss of h over objects z picked ran- 
domly according to D. Similarly, we define the empirical risk to be the expected loss 
over a given sample S = (z1,...,Zm) € Z”, namely, 


ist) = 1S eh, 2). (3.4) 
i=1 


The loss functions used in the preceding examples of classification and regression 
tasks are as follows: 


® 0-1 loss: Here, our random variable z ranges over the set of pairs VY x Y and the 
loss function is 


def JO if A(x)=y 
lo-1(h, (x,y) = ‘ if h(x)# y 


This loss function is used in binary or multiclass classification problems. 

One should note that, for a random variable, a, taking the values {0,1}, 

Ly~D [a] = Pe~p [a = 1]. Consequently, for this loss function, the definitions 
of Lp(h) given in Equation (3.3) and Equation (3.1) coincide. 

M™ Square Loss: Here, our random variable z ranges over the set of pairs V x Y and 
the loss function is 


lsq(h, (x,y) & (a(x) — yy. 


This loss function is used in regression problems. 


We will later see more examples of useful instantiations of loss functions. 
To summarize, we formally define agnostic PAC learnability for general loss 
functions. 


Definition 3.4 (Agnostic PAC Learnability for General Loss Functions). A hypoth- 
esis class H is agnostic PAC learnable with respect to a set Z and a loss function 
£:Hx Z—>R,, if there exist a function mz : (0,1)? > N and a learning algorithm 
with the following property: For every €, 6 € (0, 1) and for every distribution D over 
Z, when running the learning algorithm on m > m3,(e, 5) i.i.d. examples generated 
by D, the algorithm returns h € H such that, with probability of at least 1 — 5 (over 
the choice of the m training examples), 


Lp(h) < min Lp(h’) +e, 
WEH 


where Lp(h) = E,~p [£(h, z)]. 


Remark 3.1 (A Note About Measurability*). In the aforementioned definition, for 
every h € H, we view the function ¢(h,-) : Z > R + as a random variable and define 
L>p(h) to be the expected value of this random variable. For that, we need to require 
that the function ¢(h, -) is measurable. Formally, we assume that there is a o-algebra 
of subsets of Z, over which the probability D is defined, and that the preimage 
of every initial segment in R, is in this o-algebra. In the specific case of binary 
classification with the 0—1 loss, the o-algebra is over ¥ x {0,1} and our assumption 
on £ is equivalent to the assumption that for every h, the set {(x,h(x)):x € X} is in 
the o-algebra. 
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Remark 3.2 (Proper vs. Representation-Independent Learning”). In the preced- 
ing definition, we required that the algorithm will return a hypothesis from H. In 
some situations, H is a subset of a set H’, and the loss function can be naturally 
extended to be a function from H’ x Z to the reals. In this case, we may allow 
the algorithm to return a hypothesis h’ € H’, as long as it satisfies the requirement 
Lp(h’) < minnex Lo(h) + €. Allowing the algorithm to output a hypothesis from 
H’ is called representation independent learning, while proper learning occurs when 
the algorithm must output a hypothesis from H. Representation independent learn- 
ing is sometimes called “improper learning,” although there is nothing improper in 
representation independent learning. 


3.3 SUMMARY 


In this chapter we defined our main formal learning model — PAC learning. The 
basic model relies on the realizability assumption, while the agnostic variant does 
not impose any restrictions on the underlying distribution over the examples. We 
also generalized the PAC model to arbitrary loss functions. We will sometimes refer 
to the most general model simply as PAC learning, omitting the “agnostic” prefix 
and letting the reader infer what the underlying loss function is from the context. 
When we would like to emphasize that we are dealing with the original PAC setting 
we mention that the realizability assumption holds. In Chapter 7 we will discuss 
other notions of learnability. 


3.4 BIBLIOGRAPHIC REMARKS 


Our most general definition of agnostic PAC learning with general loss functions 
follows the works of Vladimir Vapnik and Alexey Chervonenkis (Vapnik and 
Chervonenkis 1971). In particular, we follow Vapnik’s general setting of learning 
(Vapnik 1982, Vapnik 1992, Vapnik 1995, Vapnik 1998). 

The term PAC learning was introduced by Valiant (1984). Valiant was named 
the winner of the 2010 Turing Award for the introduction of the PAC model. 
Valiant’s definition requires that the sample complexity will be polynomial in 1/e 
and in 1/6, as well as in the representation size of hypotheses in the class (see also 
Kearns and Vazirani (1994)). As we will see in Chapter 6, if a problem is at all PAC 
learnable then the sample complexity depends polynomially on 1/e and log(1/6). 
Valiant’s definition also requires that the runtime of the learning algorithm will be 
polynomial in these quantities. In contrast, we chose to distinguish between the 
statistical aspect of learning and the computational aspect of learning. We will elab- 
orate on the computational aspect later on in Chapter 8. Finally, the formalization 
of agnostic PAC learning is due to Haussler (1992). 


3.5 EXERCISES 


3.1 Monotonicity of Sample Complexity: Let H be a hypothesis class for a binary clas- 
sification task. Suppose that H is PAC learnable and its sample complexity is given 
by mz(-,-). Show that mz is monotonically nonincreasing in each of its parame- 
ters. That is, show that given 6 € (0,1), and given 0 < «; < €2 < 1, we have that 
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3.4 


3.5, 


3.5 Exercises 


my(€1,6) > m7z(e2, 5). Similarly, show that given € € (0, 1), and given 0 < 6) <6) <1, 

we have that m(¢€, 61) > my (e, 62). 

Let ¥ be a discrete domain, and let Hgingleton = {hz :z € ¥}U {h-}, where for each 

z€ 4, h, is the function defined by h,(x) = 1 if x =z and h,(x) =0 if x #z. h7 

is simply the all-negative hypothesis, namely, Vx € X, h~ (x) = 0. The realizability 

assumption here implies that the true hypothesis f labels negatively all examples in 

the domain, perhaps except one. 

1. Describe an algorithm that implements the ERM rule for learning Hgingleton in 
the realizable setup. 

2. Show that Hgingleton is PAC learnable. Provide an upper bound on the sample 
complexity. 

Let ¥ = R?, y = {0, 1}, and let H be the class of concentric circles in the plane, that 

is, H = {h, :r € Ry}, where h,(x) = Ij. <,]- Prove that H is PAC learnable (assume 

realizability), and its sample complexity is bounded by 


my (e,d) < oe ; 


In this question, we study the hypothesis class of Boolean conjunctions defined as 
follows. The instance space is VY = {0, 1} and the label set is v = {0, 1}. A literal over 
the variables x;,..., xy is a simple Boolean function that takes the form f(x) = x;, for 
some i € [d], or f(x) =1-—.; for some i € [d]. We use the notation x; as a shorthand 
for 1 —x;. A conjunction is any product of literals. In Boolean logic, the product is 
denoted using the A sign. For example, the function h(x) = x; - (1 — x2) is written as 
x1 AX2. 

We consider the hypothesis class of all conjunctions of literals over the d vari- 
ables. The empty conjunction is interpreted as the all-positive hypothesis (namely, 
the function that returns h(x) = 1 for all x). The conjunction x; A x; (and similarly 
any conjunction involving a literal and its negation) is allowed and interpreted as 
the all-negative hypothesis (namely, the conjunction that returns (x) = 0 for all x). 
We assume realizability: Namely, we assume that there exists a Boolean conjunction 
that generates the labels. Thus, each example (x, y) € ¥ x Y consists of an assign- 
ment to the d Boolean variables x;,...,xg, and its truth value (0 for false and 1 for 
true). 

For instance, let d = 3 and suppose that the true conjunction is x1 A x2. Then, the 
training set S might contain the following instances: 


((1, 1, 1), 0), (1, 0, 1), 1), ((0, 1,0), 0)((1, 0, 0), 1). 


Prove that the hypothesis class of all conjunctions over d variables is PAC learn- 

able and bound its sample complexity. Propose an algorithm that implements the 
ERM rule, whose runtime is polynomial in d-m. 
Let Y be a domain and let Dj, D2, ...,D,, be a sequence of distributions over 7. Let 
H be a finite class of binary classifiers over 7 and let f € H. Suppose we are getting 
a sample S of m examples, such that the instances are independent but are not iden- 
tically distributed; the ith instance is sampled from D; and then y; is set to be f (x;). 
Let D,, denote the average, that is, D,, =(D, +---+D,,)/m. 


Fix an accuracy parameter € € (0, 1). Show that 
P [ake Hs.t. Lop, y(t) > € and Lys, (4) =0] < Hle™™. 


Hint: Use the geometric-arithmetic mean inequality. 


www.EngineeringBooksLibrary.com 


29 


30 


A Formal Learning Model 


3.6 


3.7 


3.8 


3.9 


Let H be a hypothesis class of binary classifiers. Show that if H is agnostic PAC 
learnable, then H is PAC learnable as well. Furthermore, if A is a successful agnostic 
PAC learner for H, then A is also a successful PAC learner for H. 

(*) The Bayes optimal predictor: Show that for every probability distribution D, the 
Bayes optimal predictor fp is optimal, in the sense that for every classifier g from 
X to {0,1}, Lp( fp) < Lv(g). 

(*) We say that a learning algorithm A is better than B with respect to some 
probability distribution, D, if 


Lp(A(S)) < Lp(B(S)) 


for all samples S € (4 x {0, 1})”. We say that a learning algorithm A is better than B, 
if it is better than B with respect to all probability distributions D over ¥ x {0, 1}. 

1. A probabilistic label predictor is a function that assigns to every domain point 
x a probability value, h(x) € [0, 1], that determines the probability of predicting 
the label 1. That is, given such an / and an input, x, the label for x is predicted by 
tossing a coin with bias h(x) toward Heads and predicting 1 iff the coin comes up 
Heads. Formally, we define a probabilistic label predictor as a function, h: ¥ > 
[0, 1]. The loss of such A on an example (x, y) is defined to be |A(x) — y|, which is 
exactly the probability that the prediction of 4 will not be equal to y. Note that 
if h is deterministic, that is, returns values in {0, 1}, then |A(x) — y| = In) zy}- 
Prove that for every data-generating distribution D over ¥ x {0,1}, the Bayes 
optimal predictor has the smallest risk (w.r.t. the loss function ¢(h, (x, y)) = 
|h(x) — y|, among all possible label predictors, including probabilistic ones). 

2. Let XY be a domain and {0, 1} be a set of labels. Prove that for every distribution 
D over X x {0,1}, there exist a learning algorithm Ap that is better than any 
other learning algorithm with respect to D. 

3. Prove that for every learning algorithm A there exist a probability distribution, 
D, and a learning algorithm B such that A is not better than B w.r.t. D. 

Consider a variant of the PAC model in which there are two example oracles: one 

that generates positive examples and one that generates negative examples, both 

according to the underlying distribution D on ¥. Formally, given a target function 

f :& = {0,1}, let Dt be the distribution over Vt = {x € X : f(x) = 1} defined by 

Dt (A) = D(A)/D(47), for every A C 4’. Similarly, D~ is the distribution over 47 
induced by D. 

The definition of PAC learnability in the two-oracle model is the same as the 
standard definition of PAC learnability except that here the learner has access to 
m4, (e, 6) iid. examples from D* and m= (e, 4) i.i.d. examples from D~. The learner’s 
goal is to output / s.t. with probability at least 1 — 6 (over the choice of the two 
training sets, and possibly over the nondeterministic decisions made by the learning 
algorithm), both L(p+, f(A) < € and Lp_, p(a) <. 

1. (*) Show that if H is PAC learnable (in the standard one-oracle model), then H. 
is PAC learnable in the two-oracle model. 

2. (**) Define ht to be the always-plus hypothesis and h~ to be the always-minus 
hypothesis. Assume that h+,h~ € H. Show that if H is PAC learnable in the 
two-oracle model, then H is PAC learnable in the standard one-oracle model. 
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The first formal learning model that we have discussed was the PAC model. In 
Chapter 2 we have shown that under the realizability assumption, any finite hypoth- 
esis class is PAC learnable. In this chapter we will develop a general tool, uniform 
convergence, and apply it to show that any finite class is learnable in the agnos- 
tic PAC model with general loss functions, as long as the range loss function is 
bounded. 


4.1 UNIFORM CONVERGENCE IS SUFFICIENT FOR LEARNABILITY 


The idea behind the learning condition discussed in this chapter is very simple. 
Recall that, given a hypothesis class, H, the ERM learning paradigm works as fol- 
lows: Upon receiving a training sample, S, the learner evaluates the risk (or error) 
of each fA in H_ on the given sample and outputs a member of H that minimizes this 
empirical risk. The hope is that an 4 that minimizes the empirical risk with respect to 
S is a risk minimizer (or has risk close to the minimum) with respect to the true data 
probability distribution as well. For that, it suffices to ensure that the empirical risks 
of all members of H are good approximations of their true risk. Put another way, we 
need that uniformly over all hypotheses in the hypothesis class, the empirical risk 
will be close to the true risk, as formalized in the following. 


Definition 4.1 (¢-representative sample). A training set S is called e-representative 
(w.r.t. domain Z, hypothesis class 1, loss function £, and distribution D) if 


WhEH, |Ls(h)—Lo(h)| <e. 


The next simple lemma states that whenever the sample is (€/2)-representative, 
the ERM learning rule is guaranteed to return a good hypothesis. 


Lemma 4.2. Assume that a training set S is 5-representative (w.r.t. domain Z, 
hypothesis class H, loss function £, and distribution D). Then, any output of 


ERM24(S), namely, any hs € argmin, <7, Ls(h), satisfies 


Lop(hs) < minLp(h)+e. 
heH 
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Proof. For every h €H, 
Lo(hs) < Ls(hs)+ 5 SLs(h) +5 S Lo(h)+§54+5=Lolh) +e, 


where the first and third inequalities are due to the assumption that S is 
5-representative (Definition 4.1) and the second inequality holds since hs isan ERM 
predictor. O 


The preceding lemma implies that to ensure that the ERM rule is an agnostic 
PAC learner, it suffices to show that with probability of at least 1 — 6 over the ran- 
dom choice of a training set, it will be an €-representative training set. The uniform 
convergence condition formalizes this requirement. 


Definition 4.3 (Uniform Convergence). We say that a hypothesis class H has the 
uniform convergence property (w.r.t.a domain Z and a loss function £) if there exists 
a function mYC : (0, 1)* > N such that for every €,5 € (0, 1) and for every probabil- 
ity distribution D over Z, if S is a sample of m > m¥(e,5) examples drawn i.i.d. 
according to D, then, with probability of at least 1 — 4, S is e-representative. 


Similar to the definition of sample complexity for PAC learning, the function 
mY measures the (minimal) sample complexity of obtaining the uniform con- 
vergence property, namely, how many examples we need to ensure that with 
probability of at least 1 — 6 the sample would be e-representative. 

The term uniform here refers to having a fixed sample size that works for all 
members of 1 and over all possible probability distributions over the domain. 

The following corollary follows directly from Lemma 4.2 and the definition of 


uniform convergence. 


Corollary 4.4. If a class H has the uniform convergence property with a function mY 
then the class is agnostically PAC learnable with the sample complexity my,(€,6) < 
m¥(e/2,5). Furthermore, in that case, the ERM, paradigm is a successful agnostic 
PAC learner for H. 


4.2 FINITE CLASSES ARE AGNOSTIC PAC LEARNABLE 


In view of Corollary 4.4, the claim that every finite hypothesis class is agnostic PAC 
learnable will follow once we establish that uniform convergence holds for a finite 
hypothesis class. 

To show that uniform convergence holds we follow a two step argument, similar 
to the derivation in Chapter 2. The first step applies the union bound while the 
second step employs a measure concentration inequality. We now explain these two 
steps in detail. 

Fix some ¢€,6. We need to find a sample size m that guarantees that for any D, 
with probability of at least 1 — 6 of the choice of S = (z1,...,Zm) sampled iid. from 
D we have that for all h € H, |Ls(h) — Lp(h)| <€. That is, 


D"({S:Vh EH, |Ls(h) — Lo(h)| < €}) = 1-6. 
Equivalently, we need to show that 


D"({S 5h EH, |Ls(h) — Lo(h)| > €}) <6. 
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Writing 
{S:5h EH, |Ls(h) — Lp(h)| > €} = Unen{S : |Ls(h) — Lo(h)| > €}, 
and applying the union bound (Lemma 2.2) we obtain 


DM" ({S: dh EH, |Ls(h) — Ly(h)| > €}) < S> D"({S:|Ls(h) — Lo()| > €}). (4.1) 
heH 

Our second step will be to argue that each summand of the right-hand side of 
this inequality is small enough (for a sufficiently large m). That is, we will show that 
for any fixed hypothesis, 4, (which is chosen in advance prior to the sampling of the 
training set), the gap between the true and empirical risks, |Ls(h) — Lp(h)|, is likely 
to be small. 

Recall that Lp(h) = E-~p [€(h, z)] and that Ls(h) = + 37", €(h, zi). Since each z; 
is sampled i.i.d. from D, the expected value of the random variable €(h, z;) is Lp(h). 
By the linearity of expectation, it follows that Lp(h) is also the expected value of 
Ls(h). Hence, the quantity |Lp(h) — Ls(h)| is the deviation of the random variable 
Ls(h) from its expectation. We therefore need to show that the measure of Ls(h) is 
concentrated around its expected value. 

A basic statistical fact, the law of large numbers, states that when m goes to 
infinity, empirical averages converge to their true expectation. This is true for Ls(h), 
since it is the empirical average of m i.1.d random variables. However, since the law 
of large numbers is only an asymptotic result, it provides no information about the 
gap between the empirically estimated error and its true value for any given, finite, 
sample size. 

Instead, we will use a measure concentration inequality due to Hoeffding, which 
quantifies the gap between empirical averages and their expected value. 


Lemma 4.5 (Hoeffding’s Inequality). Let 0,,...,9m be a sequence of i.i.d. random 
variables and assume that for all i, E[@;] = «4 and P[a < 6; < b] =1. Then, for any 


e>0 
m 
1 
Pla -» 
i=1 


The proof can be found in Appendix B. 
Getting back to our problem, let 6; be the random variable ¢(h, z;). Since h is 


fixed and z,,...,Zm are sampled 1.1.d., it follows that 6;,...,, are also 1.i.d. random 
1 


variables. Furthermore, Ls(h) = = >>/,6; and Lp(h) = yw. Let us further assume 


that the range of @ is [0, 1] and therefore 6; € [0,1]. We therefore obtain that 


-| < 2exp (—2me?/(b—a)’). 


m 


Lye 


D"({S:|Ls(h)— Lo(h)| > ay=P| 
i=l 


> J < 2exp (—2me?). (4.2) 


Combining this with Equation (4.1) yields 


D"({S:3h €H,|Ls(h) — Lo(h)| > €}) < D> 2exp (—2me?) 
heH 


=2|H| exp (—2me?). 
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Finally, if we choose 
_, log (2141/8) 
~ Je 
then 
D" ({S: 5h €H, |Ls(h) — Lo(h)| > €}) <4. 


Corollary 4.6. Let H. be a finite hypothesis class, let Z be a domain, and let €:H x 
Z — [0,1] be a loss function. Then, H enjoys the uniform convergence property with 
sample complexity 


log (2|H|/6 
mur (e,6) < ae ’). 


Furthermore, the class is agnostically PAC learnable using the ERM algorithm with 
sample complexity 


my(€,5) < my (€/2,8) < je) 


Remark 4.1 (The “Discretization Trick”). While the preceding corollary only 
applies to finite hypothesis classes, there is a simple trick that allows us to get a 
very good estimate of the practical sample complexity of infinite hypothesis classes. 
Consider a hypothesis class that is parameterized by d parameters. For example, 
let X = R, Y = {+1}, and the hypothesis class, H, be all functions of the form 
ho(x) = sign(x — 0). That is, each hypothesis is parameterized by one parameter, 
6 € R, and the hypothesis outputs 1 for all instances larger than @ and outputs —1 
for instances smaller than 6. This is a hypothesis class of an infinite size. However, 
if we are going to learn this hypothesis class in practice, using a computer, we will 
probably maintain real numbers using floating point representation, say, of 64 bits. 
It follows that in practice, our hypothesis class is parameterized by the set of scalars 
that can be represented using a 64 bits floating point number. There are at most 2% 
such numbers; hence the actual size of our hypothesis class is at most 2°. More gen- 
erally, if our hypothesis class is parameterized by d numbers, in practice we learn 
a hypothesis class of size at most 24. Applying Corollary 4.6 we obtain that the 
sample complexity of such classes is bounded by Teed eees) This upper bound 
on the sample complexity has the deficiency of being dependent on the specific rep- 
resentation of real numbers used by our machine. In Chapter 6 we will introduce 
a rigorous way to analyze the sample complexity of infinite size hypothesis classes. 
Nevertheless, the discretization trick can be used to get a rough estimate of the 
sample complexity in many practical situations. 


4.3 SUMMARY 


If the uniform convergence property holds for a hypothesis class H then in most 
cases the empirical risks of hypotheses in H will faithfully represent their true 
risks. Uniform convergence suffices for agnostic PAC learnability using the ERM 
rule. We have shown that finite hypothesis classes enjoy the uniform convergence 
property and are hence agnostic PAC learnable. 
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4.4 BIBLIOGRAPHIC REMARKS 


Classes of functions for which the uniform convergence property holds are also 
called Glivenko-Cantelli classes, named after Valery Ivanovich Glivenko and 
Francesco Paolo Cantelli, who proved the first uniform convergence result in the 
1930s. See (Dudley, Gine & Zinn 1991). The relation between uniform convergence 
and learnability was thoroughly studied by Vapnik — see (Vapnik 1992, Vapnik 1995, 
Vapnik 1998). In fact, as we will see later in Chapter 6, the fundamental theorem of 
learning theory states that in binary classification problems, uniform convergence is 
not only a sufficient condition for learnability but is also a necessary condition. This 
is not the case for more general learning problems (see (Shalev-Shwartz, Shamir, 
Srebro & Sridharan 2010)). 


4.5 EXERCISES 


4.1 In this exercise, we show that the (€,5) requirement on the convergence of errors in 
our definitions of PAC learning, is, in fact, quite close to a simpler looking require- 
ment about averages (or expectations). Prove that the following two statements are 
equivalent (for any learning algorithm A, any probability distribution D, and any 
loss function whose range is [0, 1]): 

1. For every €, 5 > 0, there exists m(e, 5) such that Vm > m(e, 5) 


_Byglto(A(s)) > <3 


[Lo(A(S))]=0 


m vi 
m>oo §S~Dm 


(where Es~pm denotes the expectation over samples S of size m). 
4.2 Bounded loss functions: In Corollary 4.6 we assumed that the range of the loss func- 
tion is [0,1]. Prove that if the range of the loss function is [a,b] then the sample 
complexity satisfies 


2log (2/H|/8)(b — a)? 


my(e, 6) < my(e/2, 5) < a 
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In Chapter 2 we saw that unless one is careful, the training data can mislead the 
learner, and result in overfitting. To overcome this problem, we restricted the search 
space to some hypothesis class H. Such a hypothesis class can be viewed as reflecting 
some prior knowledge that the learner has about the task — a belief that one of 
the members of the class H is a low-error model for the task. For example, in our 
papayas taste problem, on the basis of our previous experience with other fruits, 
we may assume that some rectangle in the color-hardness plane predicts (at least 
approximately) the papaya’s tastiness. 

Is such prior knowledge really necessary for the success of learning? Maybe 
there exists some kind of universal learner, that is, a learner who has no prior knowl- 
edge about a certain task and is ready to be challenged by any task? Let us elaborate 
on this point. A specific learning task is defined by an unknown distribution D over 
xX x Y, where the goal of the learner is to find a predictor h : & — Y, whose risk, 
Lp(h), is small enough. The question is therefore whether there exist a learning 
algorithm A and a training set size m, such that for every distribution D, if A receives 
m iid. examples from D, there is a high chance it outputs a predictor h that has a 
low risk. 

The first part of this chapter addresses this question formally. The No-Free- 
Lunch theorem states that no such universal learner exists. To be more precise, the 
theorem states that for binary classification prediction tasks, for every learner there 
exists a distribution on which it fails. We say that the learner fails if, upon receiving 
i.1.d. examples from that distribution, its output hypothesis is likely to have a large 
risk, say, > 0.3, whereas for the same distribution, there exists another learner that 
will output a hypothesis with a small risk. In other words, the theorem states that no 
learner can succeed on all learnable tasks — every learner has tasks on which it fails 
while other learners succeed. 

Therefore, when approaching a particular learning problem, defined by some 
distribution D, we should have some prior knowledge on D. One type of such prior 
knowledge is that D comes from some specific parametric family of distributions. 
We will study learning under such assumptions later on in Chapter 24. Another type 
of prior knowledge on D, which we assumed when defining the PAC learning model, 
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is that there exists h in some predefined hypothesis class H, such that Lp(h) =0. A 
softer type of prior knowledge on D is assuming that minyjcz Lp(h) is small. In a 
sense, this weaker assumption on D is a prerequisite for using the agnostic PAC 
model, in which we require that the risk of the output hypothesis will not be much 
larger than mingcex Lp(h). 

In the second part of this chapter we study the benefits and pitfalls of using a 
hypothesis class as a means of formalizing prior knowledge. We decompose the 
error of an ERM algorithm over a class H into two components. The first compo- 
nent reflects the quality of our prior knowledge, measured by the minimal risk of a 
hypothesis in our hypothesis class, minynex Lp(h). This component is also called the 
approximation error, or the bias of the algorithm toward choosing a hypothesis from 
H. The second component is the error due to overfitting, which depends on the size 
or the complexity of the class H and is called the estimation error. These two terms 
imply a tradeoff between choosing a more complex 1 (which can decrease the bias 
but increases the risk of overfitting) or a less complex H (which might increase the 
bias but decreases the potential overfitting). 


5.1 THE NO-FREE-LUNCH THEOREM 


In this part we prove that there is no universal learner. We do this by showing that 
no learner can succeed on all learning tasks, as formalized in the following theorem: 


Theorem 5.1. (No-Free-Lunch) Let A be any learning algorithm for the task of 
binary classification with respect to the 0—1 loss over a domain &. Letm be any num- 
ber smaller than |X |/2, representing a training set size. Then, there exists a distribution 
D over X x {0,1} such that: 


1. There exists a function f : X — {0,1} with Lp(f)=0. 
2. With probability of at least 1/7 over the choice of S ~ D” we have that 
Lp(A(S)) = 1/8. 


This theorem states that for every learner, there exists a task on which it fails, 
even though that task can be successfully learned by another learner. Indeed, a 
trivial successful learner in this case would be an ERM learner with the hypoth- 
esis class H = {f}, or more generally, ERM with respect to any finite hypothesis 
class that contains f and whose size satisfies the equation m > 8log(7|H|/6) (see 
Corollary 2.3). 


Proof. Let C be a subset of ¥ of size 2m. The intuition of the proof is that any 
learning algorithm that observes only half of the instances in C has no information 
on what should be the labels of the rest of the instances in C. Therefore, there exists 
a “reality,” that is, some target function f, that would contradict the labels that A(S) 
predicts on the unobserved instances in C. 

Note that there are T = 27” possible functions from C to {0,1}. Denote these 
functions by fi,..., fr. For each such function, let D; be a distribution over C x {0, 1} 
defined by 


1/|C| ify = fi(x) 
0 otherwise. 


Dales 
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{0, 1} and every i we have 


1 
Lo (h) = = > Ane 00 
xEC 


p 


1 
= Ta » Wher #filrr)] 
1 


es om Maen A fie )l- (5.5) 
Hence, 
1 iE Pp 
Fo, En(A(sj)) = DoE, 5D hacsiy A filer] 
i=1 r=1 
teal 
iit T di Tacs! we filer)] 
: 5.6 
25° ming 7 Nacsa ito 6) 
Next, fix some r € [p]. We can partition all the functions in f,,..., fr into T/2 dis- 


joint pairs, where for a pair (fj, f;,) we have that for every c EC, fi(c) 4 fir(c) if and 
only if c = v,. Since for such a pair we must have si a ce it follows that 


Tacsi Mor Afi (vr)] za Tals! on é fy (or = 1, 


which yields 
r 1 


1 
T s TA(s' (oA fir] =F 


Combining this with Equation (5.6), Equation (5.4), and Equation (5.3), we obtain 
that Equation (5.1) holds, which concludes our proof. Oo 


5.1.1 No-Free-Lunch and Prior Knowledge 


How does the No-Free-Lunch result relate to the need for prior knowledge? Let us 
consider an ERM predictor over the hypothesis class H of all the functions f from 
X to {0,1}. This class represents lack of prior knowledge: Every possible function 
from the domain to the label set is considered a good candidate. According to the 
No-Free-Lunch theorem, any algorithm that chooses its output from hypotheses in 
H, and in particular the ERM predictor, will fail on some learning task. Therefore, 
this class is not PAC learnable, as formalized in the following corollary: 


Corollary 5.2. Let ¥ be an infinite domain set and let H be the set of all functions 
from & to {0,1}. Then, H is not PAC learnable. 


Proof. Assume, by way of contradiction, that the class is learnable. Choose some 
€ < 1/8 and 6 < 1/7. By the definition of PAC learnability, there must be some 
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learning algorithm A and an integer m = m(e, 6), such that for any data-generating 
distribution over ¥ x {0, 1}, if for some function f : ¥ > {0,1}, Lp(f) =0, then with 
probability greater than 1 — 5 when A is applied to samples S of size m, generated 
iid. by D, Lp(A(S)) < €. However, applying the No-Free-Lunch theorem, since 
|X| > 2m, for every learning algorithm (and in particular for the algorithm A), there 
exists a distribution D such that with probability greater than 1/7 > 6, Lp(A(S)) > 
1/8 > €, which leads to the desired contradiction. Oo 


How can we prevent such failures? We can escape the hazards foreseen by the 
No-Free-Lunch theorem by using our prior knowledge about a specific learning 
task, to avoid the distributions that will cause us to fail when learning that task. 
Such prior knowledge can be expressed by restricting our hypothesis class. 

But how should we choose a good hypothesis class? On the one hand, we want 
to believe that this class includes the hypothesis that has no error at all (in the PAC 
setting), or at least that the smallest error achievable by a hypothesis from this class 
is indeed rather small (in the agnostic setting). On the other hand, we have just seen 
that we cannot simply choose the richest class — the class of all functions over the 
given domain. This tradeoff is discussed in the following section. 


5.2 ERROR DECOMPOSITION 


To answer this question we decompose the error of an ERM, predictor into two 
components as follows. Let hs be an ERM hypothesis. Then, we can write 


Lo(hs) = €app +€est Where: €app = min Lp(h), €est =Lo(hs) — €app- (5.7) 


™ The Approximation Error — the minimum risk achievable by a predictor in 
the hypothesis class. This term measures how much risk we have because we 
restrict ourselves to a specific class, namely, how much inductive bias we have. 
The approximation error does not depend on the sample size and is determined 
by the hypothesis class chosen. Enlarging the hypothesis class can decrease the 
approximation error. 
Under the realizability assumption, the approximation error is zero. In the 
agnostic case, however, the approximation error can be large.! 
® The Estimation Error — the difference between the approximation error and the 
error achieved by the ERM predictor. The estimation error results because the 
empirical risk (i.e., training error) is only an estimate of the true risk, and so 
the predictor minimizing the empirical risk is only an estimate of the predictor 
minimizing the true risk. 
The quality of this estimation depends on the training set size and on the size, 
or complexity, of the hypothesis class. As we have shown, for a finite hypothe- 
sis class, €est increases (logarithmically) with || and decreases with m. We can 


| In fact, it always includes the error of the Bayes optimal predictor (see Chapter 3), the minimal yet 
inevitable error, because of the possible nondeterminism of the world in this model. Sometimes in the 
literature the term approximation error refers not to minjcz Lp(h), but rather to the excess error over 
that of the Bayes optimal predictor, namely, min;,<z Lp(h) — €Bayes- 
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think of the size of H as a measure of its complexity. In future chapters we will 
define other complexity measures of hypothesis classes. 


Since our goal is to minimize the total risk, we face a tradeoff, called the bias- 
complexity tradeoff. On one hand, choosing H to be a very rich class decreases the 
approximation error but at the same time might increase the estimation error, as a 
rich H might lead to overfitting. On the other hand, choosing H to be a very small 
set reduces the estimation error but might increase the approximation error or, in 
other words, might lead to underfitting. Of course, a great choice for 1 is the class 
that contains only one classifier —- the Bayes optimal classifier. But the Bayes optimal 
classifier depends on the underlying distribution D, which we do not know (indeed, 
learning would have been unnecessary had we known D). 

Learning theory studies how rich we can make H while still maintaining reason- 
able estimation error. In many cases, empirical research focuses on designing good 
hypothesis classes for a certain domain. Here, “good” means classes for which the 
approximation error would not be excessively high. The idea is that although we are 
not experts and do not know how to construct the optimal classifier, we still have 
some prior knowledge of the specific problem at hand, which enables us to design 
hypothesis classes for which both the approximation error and the estimation error 
are not too large. Getting back to our papayas example, we do not know how exactly 
the color and hardness of a papaya predict its taste, but we do know that papaya is 


rectangle in the color-hardness space may be a good predictor. 


SUMMARY 


The No-Free-Lunch theorem states that there is no universal learner. Every learner 
has to be specified to some task, and use some prior knowledge about that task, in 
order to succeed. So far we have modeled our prior knowledge by restricting our 
output hypothesis to be a member of a chosen hypothesis class. When choosing 
this hypothesis class, we face a tradeoff, between a larger, or more complex, class 
that is more likely to have a small approximation error, and a more restricted class 
that would guarantee that the estimation error will be small. In the next chapter we 
will study in more detail the behavior of the estimation error. In Chapter 7 we will 
discuss alternative ways to express prior knowledge. 


5.4 BIBLIOGRAPHIC REMARKS 


(Wolpert & Macready 1997) proved several no-free-lunch theorems for optimiza- 
tion, but these are rather different from the theorem we prove here. The theorem 
we prove here is closely related to lower bounds in VC theory, as we will study in 
the next chapter. 


5.5 EXERCISES 


5.1 Prove that Equation (5.2) suffices for showing that P[Lp(A(S)) => 1/8] = 1/7. 
Hint: Let 6 be a random variable that receives values in [0, 1] and whose expectation 
satisfies E[0] > 1/4. Use Lemma B.1 to show that P[6 > 1/8] > 1/7. 
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5.2 Assume you are asked to design a learning algorithm to predict whether patients 


5:3 


are going to suffer a heart attack. Relevant patient features the algorithm may have 
access to include blood pressure (BP), body-mass index (BMI), age (A), level of 
physical activity (P), and income (I). 

You have to choose between two algorithms; the first picks an axis aligned rect- 
angle in the two dimensional space spanned by the features BP and BMI and the 
other picks an axis aligned rectangle in the five dimensional space spanned by all 
the preceding features. 

1. Explain the pros and cons of each choice. 
2. Explain how the number of available labeled training samples will affect your 
choice. 
Prove that if |v| > km for a positive integer k > 2, then we can replace the lower 
bound of 1/4 in the No-Free-Lunch theorem with ct — 5 — x: Namely, let A be a 
learning algorithm for the task of binary classification. Let m be any number smaller 
than |V|/k, representing a training set size. Then, there exists a distribution D over 
X x {0,1} such that: 
® There exists a function f : ¥ > {0,1} with Lp(f) =0. 


® Es~pm[Lp(A(S))] = 5 — x. 
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In the previous chapter, we decomposed the error of the ERM rule into approx- 
imation error and estimation error. The approximation error depends on the fit 
of our prior knowledge (as reflected by the choice of the hypothesis class H) to 
the underlying unknown distribution. In contrast, the definition of PAC learn- 
ability requires that the estimation error would be bounded uniformly over all 
distributions. 

Our current goal is to figure out which classes H are PAC learnable, and to 
characterize exactly the sample complexity of learning a given hypothesis class. So 
far we have seen that finite classes are learnable, but that the class of all functions 
(over an infinite size domain) is not. What makes one class learnable and the other 
unlearnable? Can infinite-size classes be learnable, and, if so, what determines their 
sample complexity? 

We begin the chapter by showing that infinite classes can indeed be learn- 
able, and thus, finiteness of the hypothesis class is not a necessary condition for 
learnability. We then present a remarkably crisp characterization of the family of 
learnable classes in the setup of binary valued classification with the zero-one loss. 
This characterization was first discovered by Vladimir Vapnik and Alexey Chervo- 
nenkis in 1970 and relies on a combinatorial notion called the Vapnik-Chervonenkis 
dimension (VC-dimension). We formally define the VC-dimension, provide several 
examples, and then state the fundamental theorem of statistical learning theory, 
which integrates the concepts of learnability, VC-dimension, the ERM rule, and 
uniform convergence. 


6.1 INFINITE-SIZE CLASSES CAN BE LEARNABLE 


In Chapter 4 we saw that finite classes are learnable, and in fact the sample complex- 
ity of a hypothesis class is upper bounded by the log of its size. To show that the size 
of the hypothesis class is not the right characterization of its sample complexity, we 
first present a simple example of an infinite-size hypothesis class that is learnable. 
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Example 6.1. Let # be the set of threshold functions over the real line, namely, 
H = {ha : a € R}, where ha : R — {0,1} is a function such that ha(x) = I, <q}. To 
remind the reader, fj, <q) is 1 if x < a and 0 otherwise. Clearly, H is of infinite size. 
Nevertheless, the following lemma shows that H is learnable in the PAC model 
using the ERM algorithm. 


Lemma 6.1. Let H be the class of thresholds as defined earlier. Then, H is PAC 
learnable, using the ERM rule, with sample complexity of mz,(€,6) < [log (2/5)/e1. 


Proof. Let a* be a threshold such that the hypothesis h*(x) = 1, <a) achieves 
Lp(h*) = 0. Let D, be the marginal distribution over the domain ¥ and let ag < 
a* <a, be such that 


gp = pe aN 


€ mass é€mass 
4 a ay 
(If D,.( — 00, a*) < € we set ay = —oo and similarly for a;). Given a training set S, 
let bp = max{x : (x, 1) € S} and b; = min{x : (x, 0) € S} (if no example in S is positive 
we set by = —oo and if no example in S is negative we set bj = oo). Let bs be a 


threshold corresponding to an ERM hypothesis, 5, which implies that bs € (bo, b1). 
Therefore, a sufficient condition for Lp(hs) < € is that both bo > ap and by < a. In 
other words, 


<x [Lo(hs) > €] < gen [bo <aoVv by > a, 


and using the union bound we can bound the preceding by 


ee [Lo(hs) >€] < aoe [bo < ao] + Pe [by > ay]. (6.1) 
The event bp < ao happens if and only if all examples in S are not in the interval 
(ag,a*), whose probability mass is defined to be €, namely, 
— *\] __ _ m —em 
stim [bo < ao] = Pn [V(x,y) ES, x ¢(ag,a*)] =(1—€)" <e *”. 
Since we assume m > log(2/5)/e it follows that the equation is at most 5/2. In the 
same way it is easy to see that Ps~p [by > a,| < 5/2. Combining with Equation (6.1) 
we conclude our proof. oO 


6.2 THE VC-DIMENSION 


We see, therefore, that while finiteness of 1 is a sufficient condition for learnability, 
it is not a necessary condition. As we will show, a property called the VC-dimension 
of a hypothesis class gives the correct characterization of its learnability. To moti- 
vate the definition of the VC-dimension, let us recall the No-Free-Lunch theorem 
(Theorem 5.1) and its proof. There, we have shown that without restricting the 
hypothesis class, for any learning algorithm, an adversary can construct a distri- 
bution for which the learning algorithm will perform poorly, while there is another 
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learning algorithm that will succeed on the same distribution. To do so, the adver- 
sary used a finite set C C ¥ and considered a family of distributions that are 
concentrated on elements of C. Each distribution was derived from a “true” tar- 
get function from C to {0,1}. To make any algorithm fail, the adversary used the 
power of choosing a target function from the set of all possible functions from C to 
{0, 1}. 

When considering PAC learnability of a hypothesis class H, the adversary is 
restricted to constructing distributions for which some hypothesis h € H achieves a 
zero risk. Since we are considering distributions that are concentrated on elements 
of C, we should study how H behaves on C, which leads to the following definition. 


Definition 6.2 (Restriction of H to C). Let H be a class of functions from 4% to {0, 1} 
and let C = {c1,...,¢m} C &. The restriction of H to C is the set of functions from C 
to {0, 1} that can be derived from H. That is, 


He = {(h(c1),.--.A(cm)) hk © H}, 
where we represent each function from C to {0,1} as a vector in {0, 1}ICl, 


If the restriction of H to C is the set of all functions from C to {0, 1}, then we say 
that H shatters the set C. Formally: 


Definition 6.3 (Shattering). A hypothesis class 1 shatters a finite set C C ¥ if the 
restriction of H to C is the set of all functions from C to {0,1}. That is, |Hc¢| = 2!¢. 


Example 6.2. Let 1 be the class of threshold functions over R. Take a set C = {cj}. 
Now, if we take a =c, +1, then we have h,(c,) = 1, and if we take a = c, — 1, then 
we have h,(c1) = 0. Therefore, Hc is the set of all functions from C to {0,1}, and H 
shatters C. Now take a set C = {c1,c2}, where cj < cz. Noh € H can account for the 
labeling (0,1), because any threshold that assigns the label 0 to c; must assign the 
label 0 to cz as well. Therefore not all functions from C to {0,1} are included in Hc; 
hence C is not shattered by H. 


Getting back to the construction of an adversarial distribution as in the proof 
of the No-Free-Lunch theorem (Theorem 5.1), we see that whenever some set C is 
shattered by H, the adversary is not restricted by H, as they can construct a distri- 
bution over C based on any target function from C to {0, 1}, while still maintaining 
the realizability assumption. This immediately yields: 


Corollary 6.4. Let H. be a hypothesis class of functions from X to {0,1}. Let m bea 
training set size. Assume that there exists a set C C &X of size 2m that is shattered by 
H. Then, for any learning algorithm, A, there exist a distribution D over X x {0,1} 
and a predictor h € H such that Lp(h) = 0 but with probability of at least 1/7 over the 
choice of S~D” we have that Lp(A(S)) = 1/8. 


Corollary 6.4 tells us that if H shatters some set C of size 2m then we cannot learn 
H using m examples. Intuitively, if a set C is shattered by H, and we receive a sample 
containing half the instances of C, the labels of these instances give us no informa- 
tion about the labels of the rest of the instances in C — every possible labeling of the 
rest of the instances can be explained by some hypothesis in 1. Philosophically, 
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If someone can explain every phenomenon, his explanations are worthless. 
This leads us directly to the definition of the VC dimension. 


Definition 6.5 (VC-dimension). The VC-dimension of a hypothesis class H, 
denoted VCdim(H), is the maximal size of a set C C ¥ that can be shattered by H. If 
H can shatter sets of arbitrarily large size we say that has infinite VC-dimension. 


A direct consequence of Corollary 6.4 is therefore: 


Theorem 6.6. Let H be a class of infinite VC-dimension. Then, H is not PAC 
learnable. 


Proof. Since H has an infinite VC-dimension, for any training set size m, there exists 
a shattered set of size 2m, and the claim follows by Corollary 6.4. O 


We shall see later in this chapter that the converse is also true: A finite VC- 
dimension guarantees learnability. Hence, the VC-dimension characterizes PAC 
learnability. But before delving into more theory, we first show several examples. 


6.3 EXAMPLES 


In this section we calculate the VC-dimension of several hypothesis classes. To show 
that VCdim(#) = d we need to show that 


1. There exists a set C of size d that is shattered by H. 
2. Every set C of size d + 1 is not shattered by H. 


6.3.1 Threshold Functions 


Let H be the class of threshold functions over R. Recall Example 6.2, where we have 
shown that for an arbitrary set C = {c,}, H shatters C; therefore VCdim(H) > 1. We 
have also shown that for an arbitrary set C = {c1,c2} where c, < co, H does not 
shatter C. We therefore conclude that VCdim(H) = 1. 


6.3.2 Intervals 


Let H be the class of intervals over R, namely, H = {hay:a,b€R,a <b}, where hap: 
R — {0, 1} is a function such that /g.4(x) = I,e(a,p)]- Take the set C = {1,2}. Then, H 
shatters C (make sure you understand why) and therefore VCdim(H) > 2. Now take 
an arbitrary set C = {c,,c2,c3} and assume without loss of generality that c; < cz <c3. 
Then, the labeling (1,0, 1) cannot be obtained by an interval and therefore H does 
not shatter C. We therefore conclude that VCdim(H) = 2. 


6.3.3 Axis Aligned Rectangles 
Let H be the class of axis aligned rectangles, formally: 


H = {h(ay,a7,b1,b7) 1M S a2 and by < bz} 
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6.5 PROOF OF THEOREM 6.7 


We have already seen that 1 > 2 in Chapter 4. The implications 2 > 3 and 3 > 4 
are trivial and so is 2 > 5. The implications 4 > 6 and 5 —> 6 follow from the No- 
Free-Lunch theorem. The difficult part is to show that 6 > 1. The proof is based on 
two main claims: 


® If VCdim(H) = d, then even though H might be infinite, when restricting it to 
a finite set C C X, its “effective” size, |Hc|, is only O(|C|“). That is, the size of 
Hc grows polynomially rather than exponentially with |C|. This claim is often 
referred to as Sauer’s lemma, but it has also been stated and proved indepen- 
dently by Shelah and by Perles. The formal statement is given in Section 6.5.1 
later. 

™ In Section 4 we have shown that finite hypothesis classes enjoy the uniform con- 
vergence property. In Section 6.5.2 later we generalize this result and show 
that uniform convergence holds whenever the hypothesis class has a “small 
effective size.” By “small effective size” we mean classes for which |Hc| grows 
polynomially with |C]. 


6.5.1 Sauer’s Lemma and the Growth Function 


We defined the notion of shattering, by considering the restriction of H to a finite 
set of instances. The growth function measures the maximal “effective” size of H on 
a set of m examples. Formally: 


Definition 6.9 (Growth Function). Let H be a hypothesis class. Then the growth 
function of H, denoted tz, : N > N, is defined as 
ty(m) = max |[Hc¢l. 
CcX:|C|=m 


In words, tz(m) is the number of different functions from a set C of size m to {0, 1} 
that can be obtained by restricting H to C. 


Obviously, if VCdim(H) = d then for any m < d we have t(m) = 2”. In such 
cases, H induces all possible functions from C to {0,1}. The following beautiful 
lemma, proposed independently by Sauer, Shelah, and Perles, shows that when m 
becomes larger than the VC-dimension, the growth function increases polynomially 
rather than exponentially with m. 


Lemma 6.10 (Sauer-Shelah-Perles). Let H be a hypothesis class with VCdim(H) < 
d < oo. Then, for all m, t(m) < wa (""). In particular, if m > d +1 then ty(m) < 
(em/d)4. 


Proof of Sauer’s Lemma* 


To prove the lemma it suffices to prove the following stronger claim: For any C = 
{c1,...,Cm} we have 


VH, |Hc| < |{B CC:H shatters B}|. (6.3) 


www.EngineeringBooksLibrary.com 


49 


50 


The VC-Dimension 


The reason why Equation (6.3) is sufficient to prove the lemma is that if 
VCdim(H) < d then no set whose size is larger than d is shattered by H and therefore 


d 


{B CC:H shatters B}| < S- @. 
l 


i=0 


When m > d +1 the right-hand side of the preceding is at most (em/d)@ (see 
Lemma A.5 in Appendix A). 

We are left with proving Equation (6.3) and we do it using an inductive argu- 
ment. For m = 1, no matter what H is, either both sides of Equation (6.3) equal 
1 or both sides equal 2 (the empty set is always considered to be shattered by H). 
Assume Equation (6.3) holds for sets of size k < m and let us prove it for sets of size 
m. Fix H and C = {cq,...,¢m}. Denote C’ = {c2,..., cm} and in addition, define the 
following two sets: 


Yo = {(y2,---, ¥m) (0, y2,---, ¥m) € He V (1, yo, --- Ym) € Hc}, 


and 
Y = {(y2,---, ¥m): (0, y2,---, ¥m) € He A (A, y2,---, ¥m) € He}. 


It is easy to verify that |Hc| = |Yo| + ||. Additionally, since Yo = Hc’, using the 
induction assumption (applied on H and C’) we have that 


[Yo = |Ho| < |{B CC’ :H shatters B}| = |{B CC:c; ¢ BAH shatters B}|. 
Next, define H’ C H to be 
H! ={heH:3h! EH st. (1—h(c1),h!(c2),---sh'(em)) 
= (h(c1), A(c2), ---,A(em)}, 


namely, H’ contains pairs of hypotheses that agree on C’ and differ on cy. Using 
this definition, it is clear that if H’ shatters a set B C C’ then it also shatters the set 
BU {ci} and vice versa. Combining this with the fact that Y; = H(, and using the 
inductive assumption (now applied on H’ and C’) we obtain that 


Yi] =|Ho| <|{B CC’:H’ shatters B}| =|{B CC’: H’ shatters BU {c1}}| 
=|{B CC:c, € BAH shatters B}| < |{B CC:c; € BAH shatters B}}. 
Overall, we have shown that 
IHe| = |Yol+ |%1 
< |{B CC:c, ¢BAH shatters B}| + |{B CC :cy € BAH shatters B}| 
=|{B CC:H shatters B}|, 


which concludes our proof. 
6.5.2 Uniform Convergence for Classes of Small Effective Size 


In this section we prove that if H has small effective size then it enjoys the uniform 
convergence property. Formally, 
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Theorem 6.11. Let H. be a class and let ty, be its growth function. Then, for every D 
and every 5 € (0,1), with probability of at least 1 — 6 over the choice of S~D" we 


have 
4+ ,/log (t4,(2m)) 
Lp(h) — Ls(h)| < ——"——————.. 
|Lo(h)— Ls(h)| < on 
Before proving the theorem, let us first conclude the proof of Theorem 6.7. 
Proof of Theorem 6.7. It suffices to prove that if the VC-dimension is finite then the 


uniform convergence property holds. We will prove that 


16d 16d log (2e/d) 
we) (Se? 


16d 
ig 5) < 4 ep log ( 


From Sauer’s lemma we have that for m > d, t7(2m) < (2em/d)“. Combining this 
with Theorem 6.11 we obtain that with probability of at least 1 — 64, 


|Ls(h) — Lp(h)| < —_ = 


For simplicity assume that ,/d log (2em/d) > 4; hence, 


ILs(h) — Lo(h) < . pL) 


To ensure that the preceding is at most € we need that 
2dlog(m) 2dlog(2e/d) 
~ (be? (Se)? 


Standard algebraic manipulations (see Lemma A.2 in Appendix A) show that a 
sufficient condition for the preceding to hold is that 


2d 2d 4dlog(2e/d) 
med op le Go) + tee 


O 


Remark 6.4. The upper bound on m}/° we derived in the proof Theorem 6.7 is not 
the tightest possible. A tighter analysis that yields the bounds given in Theorem 6.8 
can be found in Chapter 28. 


Proof of Theorem 6.11* 
We will start by showing that 


4+ \/log (t(2m)) (6.4) 
v2m 


Since the random variable sup, <7, |Lp(h) — Ls(h)| is nonnegative, the proof of 
the theorem follows directly from the preceding using Markov’s inequality (see 
Section B.1). 


3 |sup|Lo(h) — Ls(h)]| < 
S~D" |hex 
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To bound the left-hand side of Equation (6.4) we first note that for every h € H, 
we can rewrite Lp(h) = Eg/~pm [Ly(h)], where S’ = z},...,z/, is an additional iid. 
sample. Therefore, 


i |sup|Lp(h)—Ls(h)|| = E_ |su i Ly(h)—Ls(h)}]. 
-5, supe) —Ls(m] =, [sup], BL (8) 50) 


A generalization of the triangle inequality yields 


Bg LEs(h) —Ls(h)] 


< E |Ly(h)—Ls(h)I, 
< ,E., ILo(h) —Ls(h) 


and the fact that supermum of expectation is smaller than expectation of supremum 
yields 


1G 


sup E  |ds(h)—Ls(h)l = 


sup |Ls(h) — Ls(h)|. 
heH S’~D" mi ( ) ( ) 


'~D" 


Formally, the previous two inequalities follow from Jensen’s inequality. Combining 
all we obtain 


& 


sm sup [Lsi(h) — Ls(h)| 


suplLo(h)—Ls(h)I] = . J 
heH. S,S'~ 


sopm 


Ste.) een]. (6.5) 
=1 


| 1 
= a m sup ah 
S,S'’~D" | peqy mM = 


The expectation on the right-hand side is over a choice of two ii.d. samples S = 
Z1,-+.,Zm and S’ = 245 .++5Z- Since all of these 2m vectors are chosen i.i.d., nothing 
will change if we replace the name of the random vector z; with the name of the 
random vector z;. If we do it, instead of the term (£(h, z') — €(h, z;)) in Equation (6.5) 
we will have the term —(€(h, z;) — €(h, z;)). It follows that for every o € {+1}” we 
have that Equation (6.5) equals 


So oi(€(h, z}) - «in| 
jel. 


: 1 
) {sup — 
S,S’~D" | pey mM 


Since this holds for every o € {+1}”, it also holds if we sample each component of o 
uniformly at random from the uniform distribution over {+1}, denoted Ux. Hence, 
Equation (6.5) also equals 


Smt) «hc : 
i=l 


S/o; (€(h, z}) — (A, zi)) 
i=l 


, ; 1 
xy Ss simp SUP a 
o~UL Ys heH 


and by the linearity of expectation it also equals 


; 1 
y sup — 
S,S'~D" o~U!! | neq MN 
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Next, fix S and S’, and let C be the instances appearing in S and S’. Then, we can 
take the supremum only over h € Hc. Therefore, 


S/ oi (€(h, z}) — | 
i=1 


nu” up m 
o~Uy | heH = 


1 
=> max — 


~ g~um |heHe m 


Smtr.) eh} ; 
i=1 


Fix some h € Hc and denote 6) = on o;(£(A, z;) — &(h, z;)). Since E[6,] =0 and 
6, is an average of independent variables, each of which takes values in [ — 1,1], we 
have by Hoeffding’s inequality that for every p > 0, 


P[|O,| > eo] <2 exp (-2m p°) F 
Applying the union bound over h € Hc, we obtain that for any p > 0, 
= 2 
P max > 0 <2|Hcl exp (—2mp?). 


Finally, Lemma A.4 in Appendix A tells us that the preceding implies 


44+ ,/l 
oy max nl < at vig itch) og (IHcl). 
heHc V2m 


Combining all with the definition of t3;, we have shown that 


4+ y/log(t(2m)) 
v2m , 


m 


} Lp(h) —Ls(h 
jean sup p(h)— Ls( 1 < 


6.6 SUMMARY 


The fundamental theorem of learning theory characterizes PAC learnability of 
classes of binary classifiers using VC-dimension. The VC-dimension of a class is a 
combinatorial property that denotes the maximal sample size that can be shattered 
by the class. The fundamental theorem states that a class is PAC learnable if and 
only if its VC-dimension is finite and specifies the sample complexity required for 
PAC learning. The theorem also shows that if a problem is at all learnable, then 
uniform convergence holds and therefore the problem is learnable using the ERM 
tule. 


6.7 BIBLIOGRAPHIC REMARKS 


The definition of VC-dimension and its relation to learnability and to uniform con- 
vergence is due to the seminal work of Vapnik and Chervonenkis (1971). The 
relation to the definition of PAC learnability is due to Blumer, Ehrenfeucht, 
Haussler, and Warmuth (1989). 

Several generalizations of the VC-dimension have been proposed. For example, 
the fat-shattering dimension characterizes learnability of some regression prob- 
lems (Kearns, Schapire & Sellie 1994; Alon, Ben-David, Cesa-Bianchi & Haussler 
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1997; Bartlett, Long & Williamson 1994; Anthony & Bartlet 1999), and the 
Natarajan dimension characterizes learnability of some multiclass learning prob- 
lems (Natarajan 1989). However, in general, there is no equivalence between 
learnability and uniform convergence. See (Shalev-Shwartz, Shamir, Srebro & 
Sridharan 2010; Daniely, Sabato, Ben-David & Shalev-Shwartz 2011). 

Sauer’s lemma has been proved by Sauer in response to a problem of Erdos 
(Sauer 1972). Shelah (with Perles) proved it as a useful lemma for Shelah’s theory 
of stable models (Shelah 1972). Gil Kalai tells' us that at some later time, Benjy 
Weiss asked Perles about such a result in the context of ergodic theory, and Perles, 
who forgot that he had proved it once, proved it again. Vapnik and Chervonenkis 
proved the lemma in the context of statistical learning theory. 


6.8 EXERCISES 


6.1 Show the following monotonicity property of VC-dimension: For every two hypoth- 
esis classes if H’ C H then VCdim(H’) < VCdim(H). 

6.2 Given some finite domain set, 7, and a number k < ||, figure out the VC-dimension 
of each of the following classes (and prove your claims): 
1. HX, = {h € {0, 1}* : |{x : A(x) = 1}| =k}: that is, the set of all functions that assign 

the value 1 to exactly k elements of ¥. 

2. Hat—most-z = {h € {0,1}* : \{x th(x)=1}| < kor |{x: h(x) =0}| <4}. 

6.3 Let XY be the Boolean hypercube {0, 1}”. For a set J C {1,2,...,} we define a parity 


function h, as follows. On a binary vector x = (x1, x2,.--,Xn) € {0, 1}”, 
h;(x) = (=*] mod 2. 
iel 


(That is, hy computes parity of bits in 7.) What is the VC-dimension of the class of 
all such parity functions, Hy-parity = {a7 : IS {1,2,...,n}}? 

6.4 We proved Sauer’s lemma by proving that for every class H of finite VC-dimension 
d, and every subset A of the domain, 


d 
A 
Hal <|{B CA : H shatters B}| < y (' Ne 
l 
i=0 


Show that there are cases in which the previous two inequalities are strict (namely, 
the < can be replaced by <) and cases in which they can be replaced by equalities. 
Demonstrate all four combinations of = and <. 

6.5 VC-dimension of axis aligned rectangles in R/: Let H4,, be the class of axis aligned 
rectangles in R’. We have already seen that VCdim(H2,,.) = 4. Prove that in general, 
VCdim(H4,,.) = 2d. 

6.6 VC-dimension of Boolean conjunctions: Let H4,,, be the class of Boolean conjunc- 
tions over the variables x1,...,xq (d => 2). We already know that this class is finite 
and thus (agnostic) PAC learnable. In this question we calculate VCdim(H{,,,). 

1. Show that |H4,,,| <3¢4+1. 
2. Conclude that VCdim(H) < dlog3. 


3. Show that H4,,, shatters the set of unit vectors {e; :i < d}. 


1 http://gilkalai.wordpress.com/2008/09/28/extremal-combinatorics-iii-some-basic-theorems 
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4. (**) Show that VCdim(H4,,) < d. 
Hint: Assume by contradiction that there exists a set C = {c1,...,cq41} that is 
shattered by H{,,,. Let h1,...,4a41 be hypotheses in H“,,, that satisfy 
0 i=j 
Vi,j €[d+1], hi(cj)= ; 
gel J, hifej) otherwise 


For each i € [d +1], A; (or more accurately, the conjunction that corresponds to 
h;) contains some literal ¢; which is false on c; and true on c; for each j #i. Use 
the Pigeonhole principle to show that there must be a pair i < j <d+1 such 
that £; and ¢; use the same x; and use that fact to derive a contradiction to the 
requirements from the conjunctions h;, h ;. 

5. Consider the class H4,,.,, of monotone Boolean conjunctions over {0, 1}4. Mono- 
tonicity here means that the conjunctions do not contain negations. As in H%,,,, 
the empty conjunction is interpreted as the all-positive hypothesis. We augment 
H4 on With the all-negative hypothesis h~. Show that VCdim(H{,_,,) =d. 

6.7 We have shown that for a finite hypothesis class H, VCdim(#) < [log (|H|)]. How- 
ever, this is just an upper bound. The VC-dimension of a class can be much lower 
than that: 

1. Find an example of a class H of functions over the real interval 7 = [0,1] such 
that H is infinite while VCdim(H) = 1. 

2. Give an example of a finite hypothesis class H over the domain ¥ = [0, 1], where 
VCdim(#) = [log, (|74|)]. 

6.8 (*) It is often the case that the VC-dimension of a hypothesis class equals (or can 
be bounded above by) the number of parameters one needs to set in order to define 
each hypothesis in the class. For instance, if H is the class of axis aligned rectangles in 
R¢, then VCdim(H) = 2d, which is equal to the number of parameters used to define 
a rectangle in R¢. Here is an example that shows that this is not always the case. 
We will see that a hypothesis class might be very complex and even not learnable, 
although it has a small number of parameters. 

Consider the domain 7 = R, and the hypothesis class 


H = {x [sin(@x)]:¢6 €R} 


(here, we take [—1] =0). Prove that VCdim(H) = co. 
Hint: There is more than one way to prove the required result. One option is by 
applying the following lemma: If 0.x1x2x3..., is the binary expansion of x € (0,1), 
then for any natural number m, [sin(2”7x)] = (1 — x), provided that 3k > m s.t. 
xp =. 

6.9 Let H be the class of signed intervals, that is, 
H= {haps :a<b,s € {—1,1}} where 


_ js ifxe[a,b] 
ha,b,s (x) a . ifx ¢ [a, b] 


Calculate VCdim(H). 
6.10 Let H be a class of functions from % to {0, 1}. 
1. Prove that if VCdim(H) > d, for any d, then for some probability distribution D 
over X x {0,1}, for every sample size, m, 


d—m 
KE [Lp(A inLp(h — 
Bgl D(A(S))] = min Lp (h) += 
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6.8 Exercises 


1. Use the Dudley representation to figure out the VC-dimension of the class 
pi — the class of all d-degree polynomials over R. 

2. Prove that the class of all polynomial classifiers over R has infinite VC- 
dimension. 

3. Use the Dudley representation to figure out the VC-dimension of the class 
P¢ (as a function of d and n). 
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Nonuniform Learnability 


The notions of PAC learnability discussed so far in the book allow the sample 
sizes to depend on the accuracy and confidence parameters, but they are uniform 
with respect to the labeling rule and the underlying data distribution. Conse- 
quently, classes that are learnable in that respect are limited (they must have a 
finite VC-dimension, as stated by Theorem 6.7). In this chapter we consider more 
relaxed, weaker notions of learnability. We discuss the usefulness of such notions 
and provide characterization of the concept classes that are learnable using these 
definitions. 

We begin this discussion by defining a notion of “nonuniform learnability” that 
allows the sample size to depend on the hypothesis to which the learner is com- 
pared. We then provide a characterization of nonuniform learnability and show that 
nonuniform learnability is a strict relaxation of agnostic PAC learnability. We also 
show that a sufficient condition for nonuniform learnability is that H is a count- 
able union of hypothesis classes, each of which enjoys the uniform convergence 
property. These results will be proved in Section 7.2 by introducing a new learning 
paradigm, which is called Structural Risk Minimization (SRM). In Section 7.3 we 
specify the SRM paradigm for countable hypothesis classes, which yields the Min- 
imum Description Length (MDL) paradigm. The MDL paradigm gives a formal 
justification to a philosophical principle of induction called Occam’s razor. Next, 
in Section 7.4 we introduce consistency as an even weaker notion of learnabil- 
ity. Finally, we discuss the significance and usefulness of the different notions of 
learnability. 


7.1 NONUNIFORM LEARNABILITY 


‘“Nonuniform learnability” allows the sample size to be nonuniform with respect to 
the different hypotheses with which the learner is competing. We say that a hypoth- 
esis h is (€,5)-competitive with another hypothesis h’ if, with probability higher than 
(1 _ 5), 


Lp(h) < Lp(h’) +€. 
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7.1 Nonuniform Learnability 


In PAC learnability, this notion of “competitiveness” is not very useful, as we 
are looking for a hypothesis with an absolute low risk (in the realizable case) or 
with a low risk compared to the minimal risk achieved by hypotheses in our class 
(in the agnostic case). Therefore, the sample size depends only on the accuracy and 
confidence parameters. In nonuniform learnability, however, we allow the sample 
size to be of the form mz(e, 5,4); namely, it depends also on the h with which we 
are competing. Formally, 


Definition 7.1. A hypothesis class H is nonuniformly learnable if there exist a 
learning algorithm, A, and a function m\¥" : (0,1)? x H > N such that, for every 
€,6 € (0,1) and for every h € H, if m > m3?" (e,5,h) then for every distribution D, 


with probability of at least 1 — 6 over the choice of S ~ D”, it holds that 
Lp(A(S)) < Lo(h) +«. 


At this point it might be useful to recall the definition of agnostic PAC learnabil- 
ity (Definition 3.3): 
A hypothesis class H. is agnostically PAC learnable if there exist a learning algorithm, 
A, and a function mz : (0,1)* — N such that, for every €,6 € (0,1) and for every dis- 
tribution D, if m > mz(€,6), then with probability of at least 1 — 6 over the choice of 
S~D" it holds that 


Lp(A(S)) < min Lp(h') +e. 
WEH 
Note that this implies that for everyh EH 


Lp(A(S)) < Lo(h) +e. 


In both types of learnability, we require that the output hypothesis will be (e, 5)- 
competitive with every other hypothesis in the class. But the difference between 
these two notions of learnability is the question of whether the sample size m may 
depend on the hypothesis h to which the error of A(S) is compared. Note that that 
nonuniform learnability is a relaxation of agnostic PAC learnability. That is, if a 
class is agnostic PAC learnable then it is also nonuniformly learnable. 


7.1.1 Characterizing Nonuniform Learnability 


Our goal now is to characterize nonuniform learnability. In the previous chapter 
we have found a crisp characterization of PAC learnable classes, by showing that a 
class of binary classifiers is agnostic PAC learnable if and only if its VC-dimension is 
finite. In the following theorem we find a different characterization for nonuniform 
learnable classes for the task of binary classification. 


Theorem 7.2. A hypothesis class H. of binary classifiers is nonuniformly learnable if 
and only if it is a countable union of agnostic PAC learnable hypothesis classes. 


The proof of Theorem 7.2 relies on the following result of independent interest: 


Theorem 7.3. Let H. be a hypothesis class that can be written as a countable union 
of hypothesis classes, H = nen Hn, where each Hn enjoys the uniform convergence 
property. Then, H. is nonuniformly learnable. 
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Nonuniform Learnability 


Recall that in Chapter 4 we have shown that uniform convergence is sufficient for 
agnostic PAC learnability. Theorem 7.3 generalizes this result to nonuniform learn- 
ability. The proof of this theorem will be given in the next section by introducing a 
new learning paradigm. We now turn to proving Theorem 7.2. 


Proof of Theorem 7.2. First assume that H = U,,c~ Hn where each H,, is agnostic 
PAC learnable. Using the fundamental theorem of statistical learning, it follows 
that each H, has the uniform convergence property. Therefore, using Theorem 7.3 
we obtain that is nonuniform learnable. 

For the other direction, assume that H is nonuniform learnable using some 
algorithm A. For every n € N, let H, = {h € H: m}"(1/8,1/7,h) < n}. Clearly, 
H = UnenHzy. In addition, using the definition of m})"" we know that for any distribu- 
tion D that satisfies the realizability assumption with respect to H,,, with probability 
of at least 6/7 over S ~ D" we have that Lp(A(S)) < 1/8. Using the fundamental 
theorem of statistical learning, this implies that the VC-dimension of H,, must be 
finite, and therefore H,, is agnostic PAC learnable. O 


The following example shows that nonuniform learnability is a strict relax- 
ation of agnostic PAC learnability; namely, there are hypothesis classes that are 
nonuniform learnable but are not agnostic PAC learnable. 


Example 7.1. Consider a binary classification problem with the instance domain 
being VY =R. For every n € N let H, be the class of polynomial classifiers of degree 
n; namely, H,, is the set of all classifiers of the form h(x) = sign(p(x)) where p: 
R — R is a polynomial of degree n. Let H = U,,<nj Hn. Therefore, H is the class 
of all polynomial classifiers over R. It is easy to verify that VCdim(H) = co while 
VCdim(H,,) =n +1 (see Exercise 7.12). Hence, H is not PAC learnable, while on 
the basis of Theorem 7.3, H is nonuniformly learnable. 


7.2 STRUCTURAL RISK MINIMIZATION 


So far, we have encoded our prior knowledge by specifying a hypothesis class H, 
which we believe includes a good predictor for the learning task at hand. Yet 
another way to express our prior knowledge is by specifying preferences over 
hypotheses within H. In the Structural Risk Minimization (SRM) paradigm, we 
do so by first assuming that can be written as H = U,cn Hn and then specify- 
ing a weight function, w : N — [0,1], which assigns a weight to each hypothesis 
class, H,, such that a higher weight reflects a stronger preference for the hypothesis 
class. In this section we discuss how to learn with such prior knowledge. In the next 
section we describe a couple of important weighting schemes, including Minimum 
Description Length. 

Concretely, let H be a hypothesis class that can be written as H = U,,<n Hn. For 
example, H may be the class of all polynomial classifiers where each H.,, is the class 
of polynomial classifiers of degree n (see Example 7.1). Assume that for each n, the 
class H, enjoys the uniform convergence property (see Definition 4.3 in Chapter 4) 
with a sample complexity function m/(e,6). Let us also define the function €, : 
N x (0,1) > (0,1) by 


€n(m,5) = min{e € (0, 1): m3 (e,5) < m}. (7.1) 
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7.2 Structural Risk Minimization 


In words, we have a fixed sample size m, and we are interested in the lowest possible 
upper bound on the gap between empirical and true risks achievable by using a 
sample of m examples. 

From the definitions of uniform convergence and €y, it follows that for every m 
and 6, with probability of at least 1 — 6 over the choice of S ~~ D” we have that 


Vh EH», |Lp(h) — Ls(h)| < €,(m,8). (7.2) 


Let w: N = [0,1] be a function such that 5°”, w(n) < 1. We refer to w as a 
weight function over the hypothesis classes Hi, H2,.... Such a weight function can 
reflect the importance that the learner attributes to each hypothesis class, or some 
measure of the complexity of different hypothesis classes. If is a finite union of N 
hypothesis classes, one can simply assign the same weight of 1/N to all hypothesis 
classes. This equal weighting corresponds to no a priori preference to any hypothesis 
class. Of course, if one believes (as prior knowledge) that a certain hypothesis class is 
more likely to contain the correct target function, then it should be assigned a larger 
weight, reflecting this prior knowledge. When H is a (countable) infinite union of 
hypothesis classes, a uniform weighting is not possible but many other weighting 
schemes may work. For example, one can choose w(n) = a or w(n) = 2~". Later 
in this chapter we will provide another convenient way to define weighting functions 
using description languages. 

The SRM rule follows a “bound minimization” approach. This means that the 
goal of the paradigm is to find a hypothesis that minimizes a certain upper bound 
on the true risk. The bound that the SRM rule wishes to minimize is given in the 
following theorem. 


Theorem 7.4. Let w: N — [0,1] be a function such that )~>, w(n) <1. Let H be a 
hypothesis class that can be written as H = ,<x Hn, where for each n, Hy satisfies 
the uniform convergence property with a sample complexity function mye. Let €, 
be as defined in Equation (7.1). Then, for every & € (0,1) and distribution D, with 
probability of at least 1 —6 over the choice of S ~ D”, the following bound holds 
(simultaneously) for everyn €N andh € Hp. 


|Lp(h) — Ls(h)| < en(m, w(n) - 8). 


Therefore, for every 5 € (0,1) and distribution D, with probability of at least 1— 6 it 
holds that 
VheH, Lo(h) < Ls(h)+ min €n(m, w(n)- 8). (7.3) 
nNEtTtn 


Proof. For eachn define 6, = w(n)6. Applying the assumption that uniform conver- 
gence holds for all n with the rate given in Equation (7.2), we obtain that if we fix n 
in advance, then with probability of at least 1 — 6, over the choice of S ~ ‘D”, 


Vh EUn, |Lp(h)—Ls(h)| < (m,n). 


Applying the union bound over n = 1,2,..., we obtain that with probability of at 
least 1—S°,, 6, =1—6)>, w(n) = 1—6, the preceding holds for all n, which concludes 
our proof. O 


Denote 
n(h) =min{n:h € Hy}, (7.4) 
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and then Equation (7.3) implies that 
Lp(h) < Ls(h) + En(n)(m, w(n(h)) : 5). 


The SRM paradigm searches for h that minimizes this bound, as formalized in 
the following pseudocode: 


Structural Risk Minimization (SRM) 


prior knowledge: 
H =U, Hn where H, has uniform convergence with m4/° 


w:N- [0,1] where >, w(n) <1 
define: €,, as in Equation (7.1); n(h) as in Equation (7.4) 
input: training set S ~ D’”, confidence 6 
output: h € argmin, <x, [Ls(h) + €nn)(m, w(n(h)) - 8)] 


Unlike the ERM paradigm discussed in previous chapters, we no longer just care 
about the empirical risk, Ls(h), but we are willing to trade some of our bias toward 
low empirical risk with a bias toward classes for which €,,;,)(m, w(n(h))-4) is smaller, 
for the sake of a smaller estimation error. 

Next we show that the SRM paradigm can be used for nonuniform learning of 
every class, which is a countable union of uniformly converging hypothesis classes. 


Theorem 7.5. Let H be a hypothesis class such that H = Unen? H,, where each H,, has 
the uniform Ce ee property with sample complexity m¥ C. Let w:N-— [0,1] be 


such that w(n) = a Then, H. is nonuniformly learnable athe the SRM rule with 
rate 


Her (e.6,h) me nts (e/2 : aoe 


Proof. Let A be the SRM algorithm with respect to the weighting function w. For 
every h EH, ¢, and 6, let m > MH yy (€ w(n(h))5). Using the fact that }>, w(n) = 1, 
we can apply Theorem 7.4 to get that, with probability of at least 1 — 6 over the 
choice of S ~~ D”, we have that for every h’ € H, 


Lo(h’) < Ls(h’) + €ny(m, w(n(h’))6). 


The preceding holds in particular for the hypothesis A(S) returned by the SRM rule. 
By the definition of SRM we obtain that 


Lp(A(S)) < min [Ls(h') + ena (m, w(n(h’))5)] < Ls(h) + nny (m, w(n(h))8). 


Finally, if m > my Kay (€/2 w(n(h))6) then clearly €,(,)(m, w(n(h))5) < €/2. In addi- 
tion, from the uniform convergence property of each H, we have that with 
probability of more than 1 — 6, 


Ls(h) < Lo(h)+€/2. 


Combining all the preceding we obtain that Lp(A(S)) < Lo(h)+€, which concludes 
our proof. O 


Note that the previous theorem also proves Theorem 7.3. 
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are more likely to be the correct one, and in the learning algorithm we prefer 
hypotheses that have higher weights. 

In this section we discuss a particular convenient way to define a weight function 
over H, which is derived from the length of descriptions given to hypotheses. Hav- 
ing a hypothesis class, one can wonder about how we describe, or represent, each 
hypothesis in the class. We naturally fix some description language. This can be 
English, or a programming language, or some set of mathematical formulas. In any 
of these languages, a description consists of finite strings of symbols (or characters) 
drawn from some fixed alphabet. We shall now formalize these notions. 

Let H be the hypothesis class we wish to describe. Fix some finite set © of sym- 
bols (or “characters”), which we call the alphabet. For concreteness, we let © = 
{0, 1}. A string is a finite sequence of symbols from ©; for example, o = (0,1,1,1,0) 
is a string of length 5. We denote by |o| the length of a string. The set of all finite 
length strings is denoted X*. A description language for H is a function d:H > &*, 
mapping each member A of H to a string d(h). d(h) is called “the description of h,” 
and its length is denoted by |h|. 

We shall require that description languages be prefix-free; namely, for every dis- 
tinct h,h’, d(h) is not a prefix of d(h’). That is, we do not allow that any string d(h) 
is exactly the first |h| symbols of any longer string d(h’). Prefix-free collections of 
strings enjoy the following combinatorial property: 


Lemma 7.6 (Kraft Inequality). If S C {0,1}* is a prefix-free set of strings, then 


1 

> che: 1; 

ocS 
Proof. Define a probability distribution over the members of S as follows: Repeat- 
edly toss an unbiased coin, with faces labeled 0 and 1, until the sequence of outcomes 
is amember of S; at that point, stop. For each o € S, let P(a) be the probability that 
this process generates the string 0. Note that since S is prefix-free, for every o € S, if 
the coin toss outcomes follow the bits of o then we will stop only once the sequence 
of outcomes equals o. We therefore get that, for every o € S, P(o) = sar: Since 
probabilities add up to at most 1, our proof is concluded. 


In light of Kraft’s inequality, any prefix-free description language of a hypothesis 


class, H, gives rise to a weighting function w over that hypothesis class — we will 


simply set w(h) = sin This observation immediately yields the following: 


Theorem 7.7. Let H be a hypothesis class and let d:H — {0,1}* be a prefix-free 
description language for H. Then, for every sample size, m, every confidence param- 
eter, 6 > 0, and every probability distribution, D, with probability greater than 1 — 6 
over the choice of S~ D" we have that, 


VheH, Lo(h) <Lo(n)+/ AE REM) 


where |h| is the length of d(h). 


Proof, Choose w(h) = 1/2!"!|, apply Theorem 7.4 with €,(m, 5) = \/ #2 and note 
fs pply 2m 
that In(2!"!) = |h|In(2) < |Al. oO 
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7.3 Minimum Description Length and Occam’s Razor 


As was the case with Theorem 7.4, this result suggests a learning paradigm for 
H — given a training set, S, search for a hypothesis 4 € H that minimizes the bound, 


Ls(h) + \/ Wales) In particular, it suggests trading off empirical risk for saving 
description length. This yields the Minimum Description Length learning paradigm. 


Minimum Description Length (MDL) 


prior knowledge: 
H is a countable hypothesis class 
H is described by a prefix-free language over {0, 1} 


For every h € H, |h| is the length of the representation of h 
input: A training set S ~ D’”, confidence 6 


output: ) ¢ argmin, <7, E s(h)+4/ Walem@/5) a 


Example 7.3. Let 1 be the class of all predictors that can be implemented using 
some programming language, say, C++. Let us represent each program using the 
binary string obtained by running the gzip command on the program (this yields 
a prefix-free description language over the alphabet {0,1}). Then, |h| is simply 
the length (in bits) of the output of gzip when running on the C++ program 
corresponding to h. 


7.3.1 Occam’s Razor 


Theorem 7.7 suggests that, having two hypotheses sharing the same empirical risk, 
the true risk of the one that has shorter description can be bounded by a lower value. 
Thus, this result can be viewed as conveying a philosophical message: 


A short explanation (that is, a hypothesis that has a short length) tends to be more 
valid than a long explanation. 


This is a well known principle, called Occam’s razor, after William of Ockham, a 
14th-century English logician, who is believed to have been the first to phrase it 
explicitly. Here, we provide one possible justification to this principle. The inequal- 
ity of Theorem 7.7 shows that the more complex a hypothesis h is (in the sense of 
having a longer description), the larger the sample size it has to fit to guarantee that 
it has a small true risk, Lp(h). 

At a second glance, our Occam razor claim might seem somewhat problematic. 
In the context in which the Occam razor principle is usually invoked in science, the 
language according to which complexity is measured is a natural language, whereas 
here we may consider any arbitrary abstract description language. Assume that we 
have two hypotheses such that |h’| is much smaller than |h|. By the preceding result, 
if both have the same error on a given training set, S, then the true error of h may 
be much higher than the true error of h’, so one should prefer h’ over h. However, 
we could have chosen a different description language, say, one that assigns a string 
of length 3 to # and a string of length 100000 to h’. Suddenly it looks as if one should 
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prefer h over h’. But these are the same h and h’ for which we argued two sentences 
ago that h’ should be preferable. Where is the catch here? 

Indeed, there is no inherent generalizability difference between hypotheses. 
The crucial aspect here is the dependency order between the initial choice of lan- 
guage (or, preference over hypotheses) and the training set. As we know from 
the basic Hoeffding’s bound (Equation (4.2)), if we commit to any hypothesis 
before seeing the data, then we are guaranteed a rather small estimation error term 
Lp(h) < Ls(h) + Gis) Choosing a description language (or, equivalently, some 
weighting of hypotheses) is a weak form of committing to a hypothesis. Rather than 
committing to a single hypothesis, we spread out our commitment among many. As 
long as it is done independently of the training sample, our generalization bound 
holds. Just as the choice of a single hypothesis to be evaluated by a sample can be 


arbitrary, so is the choice of description language. 


7.4 OTHER NOTIONS OF LEARNABILITY —- CONSISTENCY 


The notion of learnability can be further relaxed by allowing the needed sample 
sizes to depend not only on e, 6, and h but also on the underlying data-generating 
probability distribution D (that is used to generate the training sample and to deter- 
mine the risk). This type of performance guarantee is captured by the notion of 
consistency! of a learning rule. 


Definition 7.8 (Consistency). Let Z be a domain set, let P be a set of probability 
distributions over Z, and let H be a hypothesis class. A learning rule A is consistent 
with respect to H and P if there exists a function m$PN : (0,1)? x H x P > N such 
that, for every ¢,5 € (0,1), every h € H, and every DeP, ifm > m3" (e,5,h, D) then 
with probability of at least 1 — 6 over the choice of S ~ D” it holds that 


Lp(A(S)) < Lo(h) +€. 


If P is the set of all distributions,” we say that A is universally consistent with respect 
to H. 


The notion of consistency is, of course, a relaxation of our previous notion of 
nonuniform learnability. Clearly if an algorithm nonuniformly learns a class H it is 
also universally consistent for that class. The relaxation is strict in the sense that 
there are consistent learning rules that are not successful nonuniform learners. For 
example, the algorithm Memorize defined in Example 7.4 later is universally consis- 
tent for the class of all binary classifiers over N. However, as we have argued before, 
this class is not nonuniformly learnable. 


Example 7.4. Consider the classification prediction algorithm Memorize defined as 
follows. The algorithm memorizes the training examples, and, given a test point x, it 


' In the literature, consistency is often defined using the notion of either convergence in proba- 
bility (corresponding to weak consistency) or almost sure convergence (corresponding to strong 
consistency). 

2 Formally, we assume that Z is endowed with some sigma algebra of subsets Q, and by “all distributions” 
we mean all probability distributions that have Q contained in their associated family of measurable 
subsets. 
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of the error that stems from estimation error and therefore know how much of the 
error is attributed to approximation error. If the approximation error is large, we 
know that we should use a different hypothesis class. Similarly, if a nonuniform 
algorithm fails, we can consider a different weighting function over (subsets of) 
hypotheses. However, when a consistent algorithm fails, we have no idea whether 
this is because of the estimation error or the approximation error. Furthermore, 
even if we are sure we have a problem with the estimation error term, we do not 
know how many more examples are needed to make the estimation error small. 


How to Learn? How to Express Prior Knowledge? 

Maybe the most useful aspect of the theory of learning is in providing an answer to 
the question of “how to learn.” The definition of PAC learning yields the limitation 
of learning (via the No-Free-Lunch theorem) and the necessity of prior knowledge. 
It gives us a crisp way to encode prior knowledge by choosing a hypothesis class, 
and once this choice is made, we have a generic learning rule - ERM. The definition 
of nonuniform learnability also yields a crisp way to encode prior knowledge by 
specifying weights over (subsets of) hypotheses of #1. Once this choice is made, we 
again have a generic learning rule - SRM. The SRM rule is also advantageous in 
model selection tasks, where prior knowledge is partial. We elaborate on model 
selection in Chapter 11 and here we give a brief example. 

Consider the problem of fitting a one dimensional polynomial to data; namely, 
our goal is to learn a function,  : R > R, and as prior knowledge we consider the 
hypothesis class of polynomials. However, we might be uncertain regarding which 
degree d would give the best results for our data set: A small degree might not fit 
the data well (i.e., it will have a large approximation error), whereas a high degree 
might lead to overfitting (i.e., it will have a large estimation error). In the follow- 
ing we depict the result of fitting a polynomial of degrees 2, 3, and 10 to the same 
training set. 


Degree 2 Degree 3 Degree 10 


It is easy to see that the empirical risk decreases as we enlarge the degree. There- 
fore, if we choose H to be the class of all polynomials up to degree 10 then the 
ERM rule with respect to this class would output a 10 degree polynomial and would 
overfit. On the other hand, if we choose too small a hypothesis class, say, polyno- 
mials up to degree 2, then the ERM would suffer from underfitting (ie., a large 
approximation error). In contrast, we can use the SRM rule on the set of all polyno- 
mials, while ordering subsets of H according to their degree, and this will yield a 3rd 
degree polynomial since the combination of its empirical risk and the bound on its 
estimation error is the smallest. In other words, the SRM rule enables us to select 
the right model on the basis of the data itself. The price we pay for this flexibility 
(besides a slight increase of the estimation error relative to PAC learning w.r.t. the 
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optimal degree) is that we do not know in advance how many examples are needed 
to compete with the best hypothesis in H. 

Unlike the notions of PAC learnability and nonuniform learnability, the defini- 
tion of consistency does not yield a natural learning paradigm or a way to encode 
prior knowledge. In fact, in many cases there is no need for prior knowledge at all. 
For example, we saw that even the Memorize algorithm, which intuitively should not 
be called a learning algorithm, is a consistent algorithm for any class defined over 
a countable domain and a finite label set. This hints that consistency is a very weak 
requirement. 


Which Learning Algorithm Should We Prefer? 

One may argue that even though consistency is a weak requirement, it is desirable 
that a learning algorithm will be consistent with respect to the set of all functions 
from ¥ to Y, which gives us a guarantee that for enough training examples, we will 
always be as good as the Bayes optimal predictor. Therefore, if we have two algo- 
rithms, where one is consistent and the other one is not consistent, we should prefer 
the consistent algorithm. However, this argument is problematic for two reasons. 
First, maybe it is the case that for most “natural” distributions we will observe in 
practice that the sample complexity of the consistent algorithm will be so large so 
that in every practical situation we will not obtain enough examples to enjoy this 
guarantee. Second, it is not very hard to make any PAC or nonuniform learner con- 
sistent with respect to the class of all functions from 1 to Y. Concretely, consider 
a countable domain, %, a finite label set VY, and a hypothesis class, H, of functions 
from ¥ to Y. We can make any nonuniform learner for be consistent with respect 
to the class of all classifiers from ¥Y to Y using the following simple trick: Upon 
receiving a training set, we will first run the nonuniform learner over the training 
set, and then we will obtain a bound on the true risk of the learned predictor. If this 
bound is small enough we are done. Otherwise, we revert to the Memorize algorithm. 
This simple modification makes the algorithm consistent with respect to all functions 
from 4 to Y. Since it is easy to make any algorithm consistent, it may not be wise to 
prefer one algorithm over the other just because of consistency considerations. 


7.5.1 The No-Free-Lunch Theorem Revisited 


Recall that the No-Free-Lunch theorem (Theorem 5.1 from Chapter 5) implies that 
no algorithm can learn the class of all classifiers over an infinite domain. In contrast, 
in this chapter we saw that the Memorize algorithm is consistent with respect to the 
class of all classifiers over a countable infinite domain. To understand why these two 
statements do not contradict each other, let us first recall the formal statement of 
the No-Free-Lunch theorem. 

Let ¥ be a countable infinite domain and let Y = {+1}. The No-Free-Lunch 
theorem implies the following: For any algorithm, A, and a training set size, m, 
there exist a distribution over V and a function h* : 4% — J, such that if A will get 
a sample of m i.i.d. training examples, labeled by h*, then A is likely to return a 
classifier with a larger error. 

The consistency of Memorize implies the following: For every distribution over 
* and a labeling function h* : XY — Y, there exists a training set size m (that depends 
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on the distribution and on h*) such that if Memorize receives at least m examples it 
is likely to return a classifier with a small error. 

We see that in the No-Free-Lunch theorem, we first fix the training set size, and 
then find a distribution and a labeling function that are bad for this training set size. 
In contrast, in consistency guarantees, we first fix the distribution and the labeling 
function, and only then do we find a training set size that suffices for learning this 
particular distribution and labeling function. 


7.6 SUMMARY 


We introduced nonuniform learnability as a relaxation of PAC learnability and con- 
sistency as a relaxation of nonuniform learnability. This means that even classes of 
infinite VC-dimension can be learnable, in some weaker sense of learnability. We 
discussed the usefulness of the different definitions of learnability. 

For hypothesis classes that are countable, we can apply the Minimum Descrip- 
tion Length scheme, where hypotheses with shorter descriptions are preferred, 
following the principle of Occam’s razor. An interesting example is the hypoth- 
esis class of all predictors we can implement in C++ (or any other programming 
language), which we can learn (nonuniformly) using the MDL scheme. 

Arguably, the class of all predictors we can implement in C++ is a powerful class 
of functions and probably contains all that we can hope to learn in practice. The abil- 
ity to learn this class is impressive, and, seemingly, this chapter should have been the 
last chapter of this book. This is not the case, because of the computational aspect 
of learning: that is, the runtime needed to apply the learning rule. For example, to 
implement the MDL paradigm with respect to all C++ programs, we need to per- 
form an exhaustive search over all C++ programs, which will take forever. Even the 
implementation of the ERM paradigm with respect to all C++ programs of descrip- 
tion length at most 1000 bits requires an exhaustive search over 2! hypotheses. 
While the sample complexity of learning this class is just worioetais) the runtime is 
> 21000. This is a huge number — much larger than the number of atoms in the visible 
universe. In the next chapter we formally define the computational complexity of 
learning. In the second part of this book we will study hypothesis classes for which 
the ERM or SRM schemes can be implemented efficiently. 


7.7 BIBLIOGRAPHIC REMARKS 


Our definition of nonuniform learnability is related to the definition of an Occam- 
algorithm in Blumer, Ehrenfeucht, Haussler and Warmuth (1987). The concept of 
SRM is due to (Vapnik & Chervonenkis 1974, Vapnik 1995). The concept of MDL 
is due to (Rissanen 1978, Rissanen 1983). The relation between SRM and MDL 
is discussed in Vapnik (1995). These notions are also closely related to the notion 
of regularization (e.g., Tikhonov 1943). We will elaborate on regularization in the 
second part of this book. 

The notion of consistency of estimators dates back to Fisher (1922). Our pre- 
sentation of consistency follows Steinwart and Christmann (2008), who also derived 
several no-free-lunch theorems. 
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Prove that for any finite class H, and any description language d: H — {0,1}*, the 
VC-dimension of H is at most 2sup{|d(h)|:  € H} — the maximum description length 
of a predictor in H. Furthermore, if d is a prefix-free description then VCdim(#) < 
sup{|d(h)|:h € H}. 

Let H = {h, :n € N} be an infinite countable hypothesis class for binary classification. 
Show that it is impossible to assign weights to the hypotheses in . such that 

® H could be learnted nonuniformly using these weights. That is, the weighting 
function w :H — [0,1] should satisfy the condition )7,.4,w(h) < 1. 

™ The weights would be monotonically nondecreasing. That is, if i < j, then 
w(h;) < w(hj). 

™ Consider a hypothesis class H = Urey H,, where for every n EN, H,, is finite. 
Find a weighting function w: H — [0,1] such that )7,.4, w(t) < 1 and so that for 
all h € H, w(h) is determined by n(h) = min{n :h € Hy} and by |H,n)|- 

™ (*) Define such a function w when for all n H,, is countable (possibly infinite). 
Let H be some hypothesis class. For any h € H, let |h| denote the description length 
of h, according to some fixed description language. Consider the MDL learning 
paradigm in which the algorithm returns: 


: /|h| +1n(2/8) 
hs € argmin isis oe — 3 


where S is a sample of size m. For any B > 0, let Hg = {h € H: |h| < B}, and define 


hi, = in Lp(h). 
pone D(h) 


Prove a bound on Lp(hs) — Lp(h3) in terms of B, the confidence parameter 5, and 
the size of the training set m. 
Note: Such bounds are known as oracle inequalities in the literature: We wish to 
estimate how good we are compared to a reference classifier (or “oracle”) h’*,. 
In this question we wish to show a No-Free-Lunch result for nonuniform learnabil- 
ity: namely, that, over any infinite domain, the class of al/ functions is not learnable 
even under the relaxed nonuniform variation of learning. 

Recall that an algorithm, A, nonuniformly learns a hypothesis class 1 if there 
exists a function m}//" : (0, 1)? x H — N such that, for every ¢, 5 € (0, 1) and for every 
heH, if m> m3" (e, 6, h) then for every distribution D, with probability of at least 
1—6 over the choice of §S ~ D", it holds that 


Lp(A(S)) < Lo(h) +e. 


If such an algorithm exists then we say that H is nonuniformly learnable. 

1. Let A be a nonuniform learner for a class H. For each n € N define HA = {he H: 
mNUL(0,1,0.1,h) <n}. Prove that each such class H, has a finite VC-dimension. 

2. Prove that if a class H is nonuniformly learnable then there are classes H,,, so that 
H=U,nen Hn and, for every n € N, VCdim(H,,) is finite. 

3. Let H be a class that shatters an infinite set. Then, for every sequence of 
classes (H, :n € N) such that H = U,enHn, there exists some n for which 
VCdim(H,,) = 00. 

Hint: Given a class H. that shatters some infinite set K, and a sequence of classes 
(H,»:n €N), each having a finite VC-dimension, start by defining subsets K, C K 
such that, for all n, |K,| > VCdim(H,) and for anyn4m, Ky, Kn =. Now, pick 
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7.6 


5. 


6. 


for each such K,, a function f, : Ky, — {0,1} so that no h € Hy, agrees with f, on 
the domain K,, Finally, define f : X — {0,1} by combining these f,’s and prove 
that f € (H\U,en Hn): 

Construct a class H; of functions from the unit interval [0,1] to {0,1} that is 
nonuniformly learnable but not PAC learnable. 

Construct a class Hz of functions from the unit interval [0, 1] to {0,1} that is not 
nonuniformly learnable. 


In this question we wish to show that the algorithm Memorize is a consistent learner 
for every class of (binary-valued) functions over any countable domain. Let ¥ be a 
countable domain and let D be a probability distribution over ¥. 


1. 


Let {x; :i ¢ N} be an enumeration of the elements of % so that for alli < j, 
D({xi}) < D({x;}). Prove that 


a. dX saat !) ms 
Given any € > 0 prove that there exists €p > 0 such that 


D({x € X¥: D({x}) < ep}) <e. 


. Prove that for every 7 > 0, ifn is such that D({x;}) < 7 for alli > n, then for every 


meN, 
5 Te (Ax; : (D({x;}) > n and x; ¢ S)]|<ne7™". 


Conclude that if 4 is countable then for every probability distribution D over 
& there exists a function mp : (0,1) x (0,1) > N such that for every ¢,5 > 0 if 
m>mp(e,6) then 


Pe [D(x :x €S})>e] <6. 


. Prove that Memorize is a consistent learner for every class of (binary-valued) 


functions over any countable domain. 
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So far in the book we have studied the statistical perspective of learning, namely, 
how many samples are needed for learning. In other words, we focused on the 
amount of information learning requires. However, when considering automated 
learning, computational resources also play a major role in determining the com- 
plexity of a task: that is, how much computation is involved in carrying out a learning 
task. Once a sufficient training sample is available to the learner, there is some com- 
putation to be done to extract a hypothesis or figure out the label of a given test 
instance. These computational resources are crucial in any practical application of 
machine learning. We refer to these two types of resources as the sample complex- 
ity and the computational complexity. In this chapter, we turn our attention to the 
computational complexity of learning. 

The computational complexity of learning should be viewed in the wider context 
of the computational complexity of general algorithmic tasks. This area has been 
extensively investigated; see, for example, (Sipser 2006). The introductory com- 
ments that follow summarize the basic ideas of that general theory that are most 
relevant to our discussion. 

The actual runtime (in seconds) of an algorithm depends on the specific machine 
the algorithm is being implemented on (e.g., what the clock rate of the machine’s 
CPU is). To avoid dependence on the specific machine, it is common to analyze 
the runtime of algorithms in an asymptotic sense. For example, we say that the 
computational complexity of the merge-sort algorithm, which sorts a list of n items, 
is O(nlog(n)). This implies that we can implement the algorithm on any machine 
that satisfies the requirements of some accepted abstract model of computation, 
and the actual runtime in seconds will satisfy the following: there exist constants c 
and no, which can depend on the actual machine, such that, for any value of n > no, 
the runtime in seconds of sorting any n items will be at most cnlog(n). It is common 
to use the term feasible or efficiently computable for tasks that can be performed 
by an algorithm whose running time is O(p()) for some polynomial function p. 
One should note that this type of analysis depends on defining what is the input 
size n of any instance to which the algorithm is expected to be applied. For “purely 
algorithmic” tasks, as discussed in the common computational complexity literature, 


www.EngineeringBooksLibrary.com 


73 


74 


The Runtime of Learning 


this input size is clearly defined; the algorithm gets an input instance, say, a list to 
be sorted, or an arithmetic operation to be calculated, which has a well defined 
size (say, the number of bits in its representation). For machine learning tasks, the 
notion of an input size is not so clear. An algorithm aims to detect some pattern in 
a data set and can only access random samples of that data. 

We start the chapter by discussing this issue and define the computational 
complexity of learning. For advanced students, we also provide a detailed formal 
definition. We then move on to consider the computational complexity of imple- 
menting the ERM rule. We first give several examples of hypothesis classes where 
the ERM rule can be efficiently implemented, and then consider some cases where, 
although the class is indeed efficiently learnable, ERM implementation is com- 
putationally hard. It follows that hardness of implementing ERM does not imply 
hardness of learning. Finally, we briefly discuss how one can show hardness of a 
given learning task, namely, that no learning algorithm can solve it efficiently. 


4 


8.1 COMPUTATIONAL COMPLEXITY OF LEARNING 


. 


Recall that a learning algorithm has access to a domain of examples, Z, a hypothesis 
class, #1, a loss function, @, and a training set of examples from Z that are sampled 
ii.d. according to an unknown distribution D. Given parameters ¢, 5, the algorithm 
should output a hypothesis / such that with probability of at least 1 —6, 


Lp(h) < min Lp(h')+e. 
WeEH 


As mentioned before, the actual runtime of an algorithm in seconds depends 
on the specific machine. To allow machine independent analysis, we use the stan- 
dard approach in computational complexity theory. First, we rely on a notion of 
an abstract machine, such as a Turing machine (or a Turing machine over the reals 
[Blum, Shub & Smale 1989]). Second, we analyze the runtime in an asymptotic 
sense, while ignoring constant factors; thus the specific machine is not important as 
long as it implements the abstract machine. Usually, the asymptote is with respect 
to the size of the input to the algorithm. For example, for the merge-sort algorithm 
mentioned before, we analyze the runtime as a function of the number of items that 
need to be sorted. 

In the context of learning algorithms, there is no clear notion of “input size.” One 
might define the input size to be the size of the training set the algorithm receives, 
but that would be rather pointless. If we give the algorithm a very large number 
of examples, much larger than the sample complexity of the learning problem, the 
algorithm can simply ignore the extra examples. Therefore, a larger training set 
does not make the learning problem more difficult, and, consequently, the runtime 
available for a learning algorithm should not increase as we increase the size of the 
training set. Just the same, we can still analyze the runtime as a function of natural 
parameters of the problem such as the target accuracy, the confidence of achiev- 
ing that accuracy, the dimensionality of the domain set, or some measures of the 
complexity of the hypothesis class with which the algorithm’s output is compared. 

To illustrate this, consider a learning algorithm for the task of learning axis 
aligned rectangles. A specific problem of learning axis aligned rectangles is derived 
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by specifying «, 5, and the dimension of the instance space. We can define a 
sequence of problems of the type “rectangles learning” by fixing €, 6 and varying the 


dimension to be d = 2,3,4,.... We can also define another sequence of “rectangles 
learning” problems by fixing d, 5 and varying the target accuracy to be € = 5: 3: ss 


One can of course choose other sequences of such problems. Once a sequence of the 
problems is fixed, one can analyze the asymptotic runtime as a function of variables 
of that sequence. 

Before we introduce the formal definition, there is one more subtlety we need to 
tackle. On the basis of the preceding, a learning algorithm can “cheat,” by transfer- 
ring the computational burden to the output hypothesis. For example, the algorithm 
can simply define the output hypothesis to be the function that stores the training set 
in its memory, and whenever it gets a test example x it calculates the ERM hypoth- 
esis on the training set and applies it on x. Note that in this case, our algorithm has a 
fixed output (namely, the function that we have just described) and can run in con- 
stant time. However, learning is still hard — the hardness is now in implementing the 
output classifier to obtain a label prediction. To prevent this “cheating,” we shall 
require that the output of a learning algorithm must be applied to predict the label 
of a new example in time that does not exceed the runtime of training (that is, com- 
puting the output classifier from the input training sample). In the next subsection 
the advanced reader may find a formal definition of the computational complexity 
of learning. 


8.1.1 Formal Definition* 


The definition that follows relies on a notion of an underlying abstract machine, 
which is usually either a Turing machine or a Turing machine over the reals. We 
will measure the computational complexity of an algorithm using the number of 
“operations” it needs to perform, where we assume that for any machine that imple- 
ments the underlying abstract machine there exists a constant c such that any such 
“operation” can be performed on the machine using c seconds. 


Definition 8.1 (The Computational Complexity of a Learning Algorithm). We 
define the complexity of learning in two steps. First we consider the computational 
complexity of a fixed learning problem (determined by a triplet (Z, H, £2) —- a domain 
set, a benchmark hypothesis class, and a loss function). Then, in the second step we 
consider the rate of change of that complexity along a sequence of such tasks. 


1. Given a function f : (0,1)* > N, a learning task (Z,H, 2), and a learning 
algorithm, A, we say that A solves the learning task in time O(/) if there 
exists some constant number c, such that for every probability distribution D 
over Z, and input «, 5 € (0,1), when A has access to samples generated i.i.d. 
by D, 

M® A terminates after performing at most cf(e, 5) operations 
® The output of A, denoted hy, can be applied to predict the label of a new 
example while performing at most cf(¢, 5) operations 


www.EngineeringBooksLibrary.com 


75 


76 


The Runtime of Learning 


® The output of A is probably approximately correct; namely, with proba- 
bility of at least 1 — 6 (over the random samples A receives), Lp(h,) < 
ming cy Lp(h')+e 
2. Consider a sequence of learning problems, (Zn, Has ln), where problem 
n is defined by a domain Z,, a hypothesis class H,, and a loss function 
£,. Let A be a learning algorithm designed for solving learning problems 
of this form. Given a function g : N x (0,1)? — N, we say that the runtime 
of A with respect to the preceding sequence is O(g), if for all n, A solves 
the problem (Z;,Hn,£n) in time O(f,), where f, : (0,1)? > N is defined by 


fn(€, 5) = g(n,€, 4). 


We say that A is an efficient algorithm with respect to a sequence (Zn, Hn, €n) if 
its runtime is O(p(n, 1/e,1/5)) for some polynomial p. 


From this definition we see that the question whether a general learning prob- 
lem can be solved efficiently depends on how it can be broken into a sequence 
of specific learning problems. For example, consider the problem of learning a 
finite hypothesis class. As we showed in previous chapters, the ERM rule over 
H is guaranteed to (e,5)-learn if the number of training examples is order of 
my(€,6) = log(|H|/5)/e?. Assuming that the evaluation of a hypothesis on an 
example takes a constant time, it is possible to implement the ERM rule in time 
O(|H|m,(€,6)) by performing an exhaustive search over H with a training set of 
size mz(e,6). For any fixed finite H, the exhaustive search algorithm runs in poly- 
nomial time. Furthermore, if we define a sequence of problems in which |H,,| = 17, 
then the exhaustive search is still considered to be efficient. However, if we define a 
sequence of problems for which |H,,| = 2”, then the sample complexity is still poly- 
nomial in n but the computational complexity of the exhaustive search algorithm 
grows exponentially with n (thus, rendered inefficient). 


8.2 IMPLEMENTING THE ERM RULE 


Given a hypothesis class H, the ERM rule is maybe the most natural learning 
paradigm. Furthermore, for binary classification problems we saw that if learning 
is at all possible, it is possible with the ERM rule. In this section we discuss the 
computational complexity of implementing the ERM rule for several hypothesis 
classes. 

Given a hypothesis class, H, a domain set Z, and a loss function ¢, the 
corresponding ERMy, rule can be defined as follows: 


On a finite input sample S € Z’” output some h € H that minimizes the empirical 
loss, Ls(h) = w rest (Rez), 


This section studies the runtime of implementing the ERM rule for several 
examples of learning tasks. 
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8.2.1 Finite Classes 


Limiting the hypothesis class to be a finite class may be considered as a reasonably 
mild restriction. For example, 1 can be the set of all predictors that can be imple- 
mented by a C++ program written in at most 10000 bits of code. Other examples 
of useful finite classes are any hypothesis class that can be parameterized by a finite 
number of parameters, where we are satisfied with a representation of each of the 
parameters using a finite number of bits, for example, the class of axis aligned rect- 
angles in the Euclidean space, R?, when the parameters defining any given rectangle 
are specified up to some limited precision. 

As we have shown in previous chapters, the sample complexity of learning a 
finite class is upper bounded by mz(e, 5) = clog (c|H|/5)/e°, where c = 1 in the real- 
izable case and c = 2 in the nonrealizable case. Therefore, the sample complexity 
has a mild dependence on the size of H. In the example of C++ programs men- 
tioned before, the number of hypotheses is 2!°.°° but the sample complexity is only 
c(10, 000 + log (c/8))/e°. 

A straightforward approach for implementing the ERM rule over a finite 
hypothesis class is to perform an exhaustive search. That is, for each h € H we calcu- 
late the empirical risk, Ls(h), and return a hypothesis that minimizes the empirical 
risk. Assuming that the evaluation of €(h,z) on a single example takes a constant 
amount of time, k, the runtime of this exhaustive search becomes k|H|m, where 
m is the size of the training set. If we let m to be the upper bound on the sample 
complexity mentioned, then the runtime becomes k|H|clog (c|H|/5)/e€°. 

The linear dependence of the runtime on the size of # makes this approach 
inefficient (and unrealistic) for large classes. Formally, if we define a sequence 
of problems (Z;,Hn,€n)°2, such that log(|H,|) =n, then the exhaustive search 
approach yields an exponential runtime. In the example of C++ programs, if H, 
is the set of functions that can be implemented by a C++ program written in at 
most n bits of code, then the runtime grows exponentially with n, implying that the 
exhaustive search approach is unrealistic for practical use. In fact, this problem is 
one of the reasons we are dealing with other hypothesis classes, like classes of linear 
predictors, which we will encounter in the next chapter, and not just focusing on 
finite classes. 

It is important to realize that the inefficiency of one algorithmic approach (such 
as the exhaustive search) does not yet imply that no efficient ERM implementation 
exists. Indeed, we will show examples in which the ERM rule can be implemented 
efficiently. 


8.2.2 Axis Aligned Rectangles 
Let H,, be the class of axis aligned rectangles in R”, namely, 
Har = (h(ay,...,an.bjysbn) Vt, < bi} 


where 
1 if Vi, x; €[a;, bi] 


(8.1) 
0 otherwise 
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Efficiently Learnable in the Realizable Case 

Consider implementing the ERM rule in the realizable case. That is, we are given 
a training set S = (x1, y1),..-,(X%m,¥m) Of examples, such that there exists an axis 
aligned rectangle, h € H,,, for which h(x; ) = y; for alli. Our goal is to find such an axis 
aligned rectangle with a zero training error, namely, a rectangle that is consistent 
with all the labels in S. 

We show later that this can be done in time O(nm). Indeed, for each i € [n], 
set a; = min{x; : (x,1) € S} and b; = max{v; : (x, 1) € S}. In words, we take a; to be 
the minimal value of the i’th coordinate of a positive example in S and b; to be the 
maximal value of the i’th coordinate of a positive example in S. It is easy to verify 
that the resulting rectangle has zero training error and that the runtime of finding 
each a; and b; is O(m). Hence, the total runtime of this procedure is O(nm). 


Not Efficiently Learnable in the Agnostic Case 

In the agnostic case, we do not assume that some hypothesis / perfectly predicts 
the labels of all the examples in the training set. Our goal is therefore to find h 
that minimizes the number of examples for which y; 4 h(x;). It turns out that for 
many common hypothesis classes, including the classes of axis aligned rectangles we 
consider here, solving the ERM problem in the agnostic setting is NP-hard (and, 
in most cases, it is even NP-hard to find some h € H whose error is no more than 
some constant c > 1 times that of the empirical risk minimizer in #). That is, unless 
P = NP, there is no algorithm whose running time is polynomial in m and n that 
is guaranteed to find an ERM hypothesis for these problems (Ben-David, Eiron & 
Long 2003). 

On the other hand, it is worthwhile noticing that, if we fix one specific hypothesis 
class, say, axis aligned rectangles in some fixed dimension, n, then there exist effi- 
cient learning algorithms for this class. In other words, there are successful agnostic 
PAC learners that run in time polynomial in 1/e and 1/5 (but their dependence on 
the dimension n is not polynomial). 

To see this, recall the implementation of the ERM rule we presented for the 
realizable case, from which it follows that an axis aligned rectangle is determined by 
at most 2n examples. Therefore, given a training set of size m, we can perform an 
exhaustive search over all subsets of the training set of size at most 2n examples and 
construct a rectangle from each such subset. Then, we can pick the rectangle with 
the minimal training error. This procedure is guaranteed to find an ERM hypothe- 
sis, and the runtime of the procedure is m9). It follows that if n is fixed, the runtime 
is polynomial in the sample size. This does not contradict the aforementioned hard- 
ness result, since there we argued that unless P=NP one cannot have an algorithm 
whose dependence on the dimension n is polynomial as well. 
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> Boolean Conjunctions 


A Boolean conjunction is a mapping from ¥ = {0,1}" to Y = {0,1} that can be 
expressed as a proposition formula of the form x;, A... A xj, ATX), A... A\7x;,, for 
some indices i1,...,ix, j1,---, jr € [n]. The function that such a proposition formula 
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defines is 
h(x) = 1 if xj, =--- =x, =landx;,=---=x;,=0 
0 otherwise 


Let H¢ be the class of all Boolean conjunctions over {0,1}". The size of Hé is 
at most 3” + 1 (since in a conjunction formula, each element of x either appears, 
or appears with a negation sign, or does not appear at all, and we also have the all 
negative formula). Hence, the sample complexity of learning H% using the ERM 
tule is at most dlog (3/5)/e. 


Efficiently Learnable in the Realizable Case 

Next, we show that it is possible to solve the ERM problem for H% in time poly- 
nomial in n and m. The idea is to define an ERM conjunction by including in the 
hypothesis conjunction all the literals that do not contradict any positively labeled 
example. Let v1,..., V,,+ be all the positively labeled instances in the input sample S. 
We define, by induction on i < mt, a sequence of hypotheses (or conjunctions). Let 
ho be the conjunction of all possible literals. That is, hp =x1 A7x1 Ax2A...AXnA7Xp. 
Note that ho assigns the label 0 to all the elements of 1. We obtain h;+1 by deleting 
from the conjunction A; all the literals that are not satisfied by v;+1. The algorithm 
outputs the hypothesis h,,+. Note that ,,+ labels positively all the positively labeled 
examples in S. Furthermore, for every i < m™, h; is the most restrictive conjunction 
that labels v1,...,v; positively. Now, since we consider learning in the realizable 
setup, there exists a conjunction hypothesis, f € H¢, that is consistent with all the 
examples in S. Since h,,+ is the most restrictive conjunction that labels positively all 
the positively labeled members of S, any instance labeled 0 by f is also labeled 0 
by h,,+. It follows that h,,+ has zero training error (w.r.t. S) and is therefore a legal 
ERM hypothesis. Note that the running time of this algorithm is O(mn). 


Not Efficiently Learnable in the Agnostic Case 

As in the case of axis aligned rectangles, unless P = NP, there is no algorithm whose 
running time is polynomial in m and n that guaranteed to find an ERM hypothesis 
for the class of Boolean conjunctions in the unrealizable case. 


8.2.4 Learning 3-Term DNF 


We next show that a slight generalization of the class of Boolean conjunctions leads 
to intractability of solving the ERM problem even in the realizable case. Consider 
the class of 3-term disjunctive normal form formulae (3-term DNF). The instance 
space is X = {0,1}” and each hypothesis is represented by the Boolean formula of 
the form h(x) = A; (x) V A2(x) V A3(x), where each A;(x) is a Boolean conjunction (as 
defined in the previous section). The output of h(x) is 1 if either A;(x) or A(x) or 
A3(x) outputs the label 1. If all three conjunctions output the label 0 then h(x) = 0. 

Let H3pyp be the hypothesis class of all such 3-term DNF formulae. The size 
of H3y yp is at most 3°”. Hence, the sample complexity of learning H4,- using the 
ERM tule is at most 3n log (3/5)/e. 

However, from the computational perspective, this learning problem is hard. 
It has been shown (see (Pitt & Valiant 1988, Kearns, Schapire & Sellie 1994)) 
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3-term-DNF formulae over {0,1}” 


8.4 HARDNESS OF LEARNING* 


We have just demonstrated that the computational hardness of implementing 
ERM, does not imply that such a class H is not learnable. How can we prove that 
a learning problem is computationally hard? 

One approach is to rely on cryptographic assumptions. In some sense, cryptog- 
raphy is the opposite of learning. In learning we try to uncover some rule underlying 
the examples we see, whereas in cryptography, the goal is to make sure that nobody 
will be able to discover some secret, in spite of having access to some partial infor- 
mation about it. On that high level intuitive sense, results about the cryptographic 
security of some system translate into results about the unlearnability of some corre- 
sponding task. Regrettably, currently one has no way of proving that a cryptographic 
protocol is not breakable. Even the common assumption of P 4 NP does not suffice 
for that (although it can be shown to be necessary for most common cryptographic 
scenarios). The common approach for proving that cryptographic protocols are 
secure is to start with some cryptographic assumptions. The more these are used 
as a basis for cryptography, the stronger is our belief that they really hold (or, at 
least, that algorithms that will refute them are hard to come by). 

We now briefly describe the basic idea of how to deduce hardness of learnability 
from cryptographic assumptions. Many cryptographic systems rely on the assump- 
tion that there exists a one way function. Roughly speaking, a one way function is 
a function f : {0,1}" — {0,1}” (more formally, it is a sequence of functions, one for 
each dimension n) that is easy to compute but is hard to invert. More formally, f 
can be computed in time poly(n) but for any randomized polynomial time algorithm 
A, and for every polynomial p(-), 


P[F(ACF (x) = FI] < 5G, 


where the probability is taken over a random choice of x according to the uniform 
distribution over {0, 1}” and the randomness of A. 

A one way function, f, is called trapdoor one way function if, for some polyno- 
mial function p, for every n there exists a bit-string s, (called a secret key) of length 
< p(n), such that there is a polynomial time algorithm that, for every n and every 
x € {0, 1}", on input (f(x), s,) outputs x. In other words, although f is hard to invert, 
once one has access to its secret key, inverting f becomes feasible. Such functions 
are parameterized by their secret key. 


www.EngineeringBooksLibrary.com 


81 


82 


The Runtime of Learning 


Now, let F, be a family of trapdoor functions over {0, 1}” that can be calculated 
by some polynomial time algorithm. That is, we fix an algorithm that given a secret 
key (representing one function in F,,) and an input vector, it calculates the value 
of the function corresponding to the secret key on the input vector in polynomial 
time. Consider the task of learning the class of the corresponding inverses, H? = 
{f-!: f € F,}. Since each function in this class can be inverted by some secret key 
Sn Of size polynomial in n, the class H? can be parameterized by these keys and its 
size is at most 2?), Its sample complexity is therefore polynomial in n. We claim 
that there can be no efficient learner for this class. If there were such a learner, L, 
then by sampling uniformly at random a polynomial number of strings in {0,1}”, 
and computing f over them, we could generate a labeled training sample of pairs 
(f(x), x), which should suffice for our learner to figure out an (€,5) approximation 
of f—' (w.r.t. the uniform distribution over the range of f), which would violate the 
one way property of f. 

A more detailed treatment, as well as a concrete example, can be found in 
(Kearns and Vazirani 1994, chapter 6). Using reductions, they also show that the 
class of functions that can be calculated by small Boolean circuits is not efficiently 
learnable, even in the realizable case. 


8.5 SUMMARY 


The runtime of learning algorithms is asymptotically analyzed as a function of dif- 
ferent parameters of the learning problem, such as the size of the hypothesis class, 
our measure of accuracy, our measure of confidence, or the size of the domain 
set. We have demonstrated cases in which the ERM rule can be implemented 
efficiently. For example, we derived efficient algorithms for solving the ERM prob- 
lem for the class of Boolean conjunctions and the class of axis aligned rectangles, 
under the realizability assumption. However, implementing ERM for these classes 
in the agnostic case is NP-hard. Recall that from the statistical perspective, there 
is no difference between the realizable and agnostic cases (i.e., a class is learn- 
able in both cases if and only if it has a finite VC-dimension). In contrast, as we 
saw, from the computational perspective the difference is immense. We have also 
shown another example, the class of 3-term DNF, where implementing ERM is 
hard even in the realizable case, yet the class is efficiently learnable by another 
algorithm. 

Hardness of implementing the ERM rule for several natural hypothesis classes 
has motivated the development of alternative learning methods, which we will 
discuss in the next part of this book. 


8.6 BIBLIOGRAPHIC REMARKS 


Valiant (1984) introduced the efficient PAC learning model in which the runtime of 
the algorithm is required to be polynomial in 1/e, 1/5, and the representation size 
of hypotheses in the class. A detailed discussion and thorough bibliographic notes 
are given in Kearns and Vazirani (1994). 
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8.7 EXERCISES 


8.1 


8.2 


8.3 


Let H be the class of intervals on the line (formally equivalent to axis aligned rect- 
angles in dimension n = 1). Propose an implementation of the ERM , learning rule 
(in the agnostic case) that given a training set of size m, runs in time O(m7”). 

Hint: Use dynamic programming. 

Let H1, H2,... be a sequence of hypothesis classes for binary classification. Assume 
that there is a learning algorithm that implements the ERM rule in the realizable 
case such that the output hypothesis of the algorithm for each class H,, only depends 
on O(n) examples out of the training set. Furthermore, assume that such a hypothe- 
sis can be calculated given these O(n) examples in time O(n), and that the empirical 
risk of each such hypothesis can be evaluated in time O(mn). For example, if H, is 
the class of axis aligned rectangles in R”, we saw that it is possible to find an ERM 
hypothesis in the realizable case that is defined by at most 2” examples. Prove that 
in such cases, it is possible to find an ERM hypothesis for H,, in the unrealizable case 
in time O(mnm?™)). 

In this exercise, we present several classes for which finding an ERM classifier is 
computationally hard. First, we introduce the class of n-dimensional halfspaces, 
HS,, for a domain XY = R". This is the class of all functions of the form hy ,(x) = 
sign((w, x) +b) where w, x € R", (w, x) is their inner product, and b € R. See a detailed 
description in Chapter 9. 

1. Show that ERM», over the class H = HS, of linear predictors is computationally 
hard. More precisely, we consider the sequence of problems in which the dimen- 
sion n grows linearly and the number of examples m is set to be some constant 
times n. 

Hint: You can prove the hardness by a reduction from the following problem: 


Max FS: Given a system of linear inequalities, Ax > b with A € R”*” and 
be R"” (that is, asystem of m linear inequalities in n variables, x= (x1,...,Xn)), 
find a subsystem containing as many inequalities as possible that has a 
solution (such a subsystem is called feasible). 


It has been shown (Sankaran 1993) that the problem Max FS is NP-hard. 

Show that any algorithm that finds an ERMys, hypothesis for any training sam- 
ple S € (R” x {+1, —1})” can be used to solve the Max FS problem of size m,n. 
Hint: Define a mapping that transforms linear inequalities in n variables into 
labeled points in R”, and a mapping that transforms vectors in R” to halfspaces, 
such that a vector w satisfies an inequality g if and only if the labeled point 
that corresponds to q is classified correctly by the halfspace corresponding to 
w. Conclude that the problem of empirical risk minimization for halfspaces in 
also NP-hard (that is, if it can be solved in time polynomial in the sample size, 
m, and the Euclidean dimension, n, then every problem in the class NP can be 
solved in polynomial time). 

2. Let ¥ =R" and let H{ be the class of all intersections of k-many linear halfspaces 
in R”. In this exercise, we wish to show that ERM,” is computationally hard for 
every k > 3. Precisely, we consider a sequence of problems where k > 3 is a 
constant and n grows linearly. The training set size, m, also grows linearly with n. 
Toward this goal, consider the k-coloring problem for graphs, defined as 
follows: 


Given a graph G = (V, £), and a number k, determine whether there exists a 
function f : V > {1...k} so that for every (u,v) € E, f(u) 4 f(v). 
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The k-coloring problem is known to be NP-hard for every k > 3 (Karp 1972). 

We wish to reduce the k-coloring problem to E RM yn: that is, to prove that if 

there is an algorithm that solves the ERM,» problem in time polynomial in k, n, 

and the sample size m, then there is a polynomial time algorithm for the graph 

k-coloring problem. 

Given a graph G = (V, E), let {v1 ...v,} be the vertices in V. Construct a sample 

S(G) € (R" x {£1})”, where m = |V|+|E], as follows: 

M™ For every v; € V, construct an instance e; with a negative label. 

™ For every edge (v;,v;) € E, construct an instance (e; + e;)/2 with a positive 
label. 

1. Prove that if there exists some h € Hj that has zero error over S(G) then G is 
k-colorable. 
Hint: Let h = qm h; be an ERM classifier in HZ over S. Define a 
coloring of V by setting f(v;) to be the minimal j such that hj;(e;) = 
—1. Use the fact that halfspaces are convex sets to show that it cannot 
be true that two vertices that are connected by an edge have the same 
color. 

2. Prove that if G is k-colorable then there exists some h € H;’ that has zero error 
over S(G). 
Hint: Given a coloring f of the vertices of G, we should come up 
with k hyperplanes, h;...h; whose intersection is a perfect classifier for 
S(G). Let b = 0.6 for all of these hyperplanes and, for t < k let the 
ith weight of the #’th hyperplane, w;;, be —1 if f(v;) = + and 0 
otherwise. 

3. On the basis of the preceding, prove that for any k > 3, the ERM,” problem 
is NP-hard. 


8.4 In this exercise we show that hardness of solving the ERM problem is equivalent to 


hardness of proper PAC learning. Recall that by “properness” of the algorithm we 
mean that it must output a hypothesis from the hypothesis class. To formalize this 
statement, we first need the following definition. 


Definition 8.2. The complexity class Randomized Polynomial (RP) time is the class 
of all decision problems (that is, problems in which on any instance one has to find 
out whether the answer is YES or NO) for which there exists a probabilistic algo- 
rithm (namely, the algorithm is allowed to flip random coins while it is running) with 
these properties: 

On any input instance the algorithm runs in polynomial time in the input size. 

If the correct answer is NO, the algorithm must return NO. 

If the correct answer is YES, the algorithm returns YES with probability a > 1/2 

and returns NO with probability 1 —a.! 


Clearly the class RP contains the class P. It is also known that RP is contained in the 
class NP. It is not known whether any equality holds among these three complexity 
classes, but it is widely believed that NP is strictly larger than RP. In particular, it is 
believed that NP-hard problems cannot be solved by a randomized polynomial time 
algorithm. 
Show that if a class H is properly PAC learnable by a polynomial time algorithm, 
then the ERM y problem is in the class RP. In particular, this implies that when- 
ever the ERM, problem is NP-hard (for example, the class of intersections of 


! The constant 1/2 in the definition can be replaced by any constant in (0, 1). 
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halfspaces discussed in the previous exercise), then, unless NP = RP, there exists 
no polynomial time proper PAC learning algorithm for H. 

Hint: Assume you have an algorithm A that properly PAC learns a class H in 
time polynomial in some class parameter n as well as in 1/e and 1/5. Your 
goal is to use that algorithm as a subroutine to contract an algorithm B for 
solving the ERM problem in random polynomial time. Given a training set, 
S €(X x {£1}’"), and some h € H whose error on S is zero, apply the PAC 
learning algorithm to the uniform distribution over S and run it so that with 
probability > 0.3 it finds a function h € H that has error less than € = 1/|S| (with 
respect to that uniform distribution). Show that the algorithm just described 
satisfies the requirements for being a RP solver for ERMy. 
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In this chapter we will study the family of linear predictors, one of the most useful 
families of hypothesis classes. Many learning algorithms that are being widely used 
in practice rely on linear predictors, first and foremost because of the ability to learn 
them efficiently in many cases. In addition, linear predictors are intuitive, are easy 
to interpret, and fit the data reasonably well in many natural learning problems. 

We will introduce several hypothesis classes belonging to this family —- 
halfspaces, linear regression predictors, and logistic regression predictors — and 
present relevant learning algorithms: linear programming and the Perceptron 
algorithm for the class of halfspaces and the Least Squares algorithm for linear 
regression. This chapter is focused on learning linear predictors using the ERM 
approach; however, in later chapters we will see alternative paradigms for learning 
these hypothesis classes. 

First, we define the class of affine functions as 


La = {hwy :we R4,b ER}, 


where 


d 
hw.p(X) = (w,x) +b = (> ve +b. 


i=1 
It will be convenient also to use the notation 


La ={xb (w,x) +b: we R?,beER}, 


which reads as follows: La is a set of functions, where each function is parameterized 
by we R¢ andb €R, and each such function takes as input a vector x and returns as 
output the scalar (w, x) +b. 

The different hypothesis classes of linear predictors are compositions of a func- 
tion ¢:R — Y on Ly. For example, in binary classification, we can choose ¢ to be 
the sign function, and for regression problems, where Y = R, ¢ is simply the identity 
function. 

It may be more convenient to incorporate b, called the bias, into w as an extra 
coordinate and add an extra coordinate with a value of 1 to all x € V; namely, 
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let W = (b, wi, w2,... wa) € R“*! and let x’ = (1, x1, .x2,...,xa) € R“*!. Therefore, 
hw,.p(X) = (W, x) +b = (w,x’). 


It follows that each affine function in R¢ can be rewritten as a homogenous linear 
function in R¢*! applied over the transformation that appends the constant 1 to 
each input vector. Therefore, whenever it simplifies the presentation, we will omit 
the bias term and refer to Ly as the class of homogenous linear functions of the form 
hw(x) = (w,x). 

Throughout the book we often use the general term “linear functions” for both 
affine functions and (homogenous) linear functions. 


9.1 HALFSPACES 


The first hypothesis class we consider is the class of halfspaces, designed for binary 
classification problems, namely, ¥ = R@ and Y = {—1, +1}. The class of halfspaces is 
defined as follows: 


HSq=signo Lg = {xt sign(hw,5(x)) : Aw.y € La}. 


In other words, each halfspace hypothesis in H Sq is parameterized by w € R¢ and 
b € R and upon receiving a vector x the hypothesis returns the label sign((w, x) +b). 

To illustrate this hypothesis class geometrically, it is instructive to consider the 
case d = 2. Each hypothesis forms a hyperplane that is perpendicular to the vector 
w and intersects the vertical axis at the point (0,—b/w2). The instances that are 
“above” the hyperplane, that is, share an acute angle with w, are labeled positively. 
Instances that are “below” the hyperplane, that is, share an obtuse angle with w, are 
labeled negatively. 


In Section 9.1.3 we will show that VCdim(HS,;) = d+ 1. It follows that we 
can learn halfspaces using the ERM paradigm, as long as the sample size is 


Q frtog/)) Therefore, we now discuss how to implement an ERM procedure 


for halfspaces. 

We introduce in the following two solutions to finding an ERM halfspace in the 
realizable case. In the context of halfspaces, the realizable case is often referred to 
as the “separable” case, since it is possible to separate with a hyperplane all the 
positive examples from all the negative examples. Implementing the ERM rule in 
the nonseparable case (i.e., the agnostic case) is known to be computationally hard 
(Ben-David and Simon, 2001). There are several approaches to learning nonsepa- 
rable data. The most popular one is to use surrogate loss functions, namely, to learn 
a halfspace that does not necessarily minimize the empirical risk with the 0—1 loss, 
but rather with respect to a diffferent loss function. For example, in Section 9.3 we 
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will describe the logistic regression approach, which can be implemented efficiently 
even in the nonseparable case. We will study surrogate loss functions in more detail 
later on in Chapter 12. 


9.1.1 Linear Programming for the Class of Halfspaces 


Linear programs (LP) are problems that can be expressed as maximizing a linear 
function subject to linear inequalities. That is, 


max (u, W) 
weRd 
subject to AW>Vv 


where w € R¢ is the vector of variables we wish to determine, A is an m x d matrix, 
and v € R”,u € R¢ are vectors. Linear programs can be solved efficiently,! and 
furthermore, there are publicly available implementations of LP solvers. 

We will show that the ERM problem for halfspaces in the realizable case can be 
expressed as a linear program. For simplicity, we assume the homogenous case. Let 
S = {(x;, y;)}7_, be a training set of size m. Since we assume the realizable case, an 
ERM predictor should have zero errors on the training set. That is, we are looking 
for some vector w € R¢ for which 


sign((w,x;))=yi, VWi=l,...,m. 
Equivalently, we are looking for some vector w for which 
yilw,x;) > 0, Vi=1,...,m. 


Let w* be a vector that satisfies this condition (it must exist since we assume real- 
izability). Define y = min; (y;(w*,x;)) and let w = ¥., Therefore, for all i we 


have 


1 
yi (W, x1) = —yi(w*,x;) > 1. 
Y 


We have thus shown that there exists a vector that satisfies 
yi(w,x;) > 1, Vi=1,...,m. (9.1) 


And clearly, such a vector is an ERM predictor. 

To find a vector that satisfies Equation (9.1) we can rely on an LP solver as 
follows. Set A to be the m x d matrix whose rows are the instances multiplied by y,. 
That is, Aj,; = y;xij,;, where x;,; is the j’th element of the vector x;. Let v be the 
vector (1,...,1) € R”. Then, Equation (9.1) can be rewritten as 


AW> Vv. 


The LP form requires a maximization objective, yet all the w that satisfy the 
constraints are equal candidates as output hypotheses. Thus, we set a “dummy” 
objective, u= (0,...,0) € R¢. 


1 Namely, in time polynomial in m, d, and in the representation size of real numbers. 
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9.1.2 Perceptron for Halfspaces 


A different implementation of the ERM rule is the Perceptron algorithm of Rosen- 
blatt (Rosenblatt 1958). The Perceptron is an iterative algorithm that constructs a 
sequence of vectors wh) wo, Initially, w“) is set to be the all-zeros vector. At 
iteration r, the Perceptron finds an example i that is mislabeled by w™, namely, an 
example for which sign((w“),x;)) 4 y;. Then, the Perceptron updates w” by adding 
to it the instance x; scaled by the label y;. That is, w’+)) = w” + y;x;. Recall that 
our goal is to have y;(w,x;) > 0 for alli and note that 


yitwlt), x;) = yi (wO + yixi,xi) = yi(wO,x;) + [lxill?. 
Hence, the update of the Perceptron guides the solution to be “more correct” on 
the i’th example. 
Batch Perceptron 


input: A training set (x1, y1),.--, (Km, ¥m) 
initialize: w') = (0,...,0) 
fort=1,2,... 


if (Ji s.t. y;(w”,x;) <0) then 
wit) = w + y;x; 

else 
output w 


The following theorem guarantees that in the realizable case, the algorithm stops 
with all sample points correctly classified. 


Theorem 9.1. Assume that (x1, y1),..., (Xm, Ym) is separable, let B = min{||w]| : Vi € 
[m], yi(w,x;) > 1}, and let R = max; ||x;||. Then, the Perceptron algorithm stops after 
at most (RB) iterations, and when it stops it holds that Vi € [m], y;(w,x;) > 0. 


Proof. By the definition of the stopping condition, if the Perceptron stops it must 
have separated all the examples. We will show that if the Perceptron runs for T 
iterations, then we must have T < (RB)’, which implies the Perceptron must stop 
after at most (RB) iterations. 

Let w* be a vector that achieves the minimum in the definition of B. That is, 
yi(w*,x;) > 1 for all i, and among all vectors that satisfy these constraints, w* is of 
minimal norm. 

The idea of the proof is to show that after performing T iterations, the cosine of 


the angle between w* and w7 +?) is at least us That is, we will show that 


x (T+1) 
wi wi) v7 (02) 
wel iwTtD]| — RB 


By the Cauchy-Schwartz inequality, the left-hand side of Equation (9.2) is at most 
1. Therefore, Equation (9.2) would imply that 


T 
122 = recrey, 


which will conclude our proof. 
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To show that Equation (9.2) holds, we first show that (w*,w“+)) > T. Indeed, 
at the first iteration, w) = (0,...,0) and therefore (w’*, w)) = 0, while on iteration 
t, if we update using example (x;, y;) we have that 


(wt, witD) = (w*, w°)) = (wt, wit) —w)) 
= (wW*, yiXi) = yi (W", xi) 
>1. 


Therefore, after performing T iterations, we get 


N 


wT+D) y= Do (iw", wit) — Ow *w()) Sop (9.3) 


t=! 


as required. 
Next, we upper bound Iw? +) ||, For each iteration t we have that 


wD |]? = pw + ix? 
= ||w |? +. 2y:(w,x;) + y? Ix |? 
< jw |)? +R? (9.4) 


where the last inequality is due to the fact that example i is necessarily such that 
Vj (w)x;) <0, and the norm of x; is at most R. Now, since \jw!) || = 0, if we use 
Equation (9.4) recursively for T iterations, we obtain that 


IwTtD 2 <TR? = |wTtD <J/TR. (9.5) 


Combining Equation (9.3) with Equation (9.5), and using the fact that ||w*|| = B, we 
obtain that 


(wh +1) wr) . T 7 JT 
Iwe|IwFtD | ~ BYTR BR’ 


We have thus shown that Equation (9.2) holds, and this concludes our proof. O 
Remark 9.1. The Perceptron is simple to implement and is guaranteed to converge. 
However, the convergence rate depends on the parameter B, which in some sit- 
uations might be exponentially large in d. In such cases, it would be better to 
implement the ERM problem by solving a linear program, as described in the pre- 


vious section. Nevertheless, for many natural data sets, the size of B is not too large, 
and the Perceptron converges quite fast. 


9.1.3 The VC Dimension of Halfspaces 


To compute the VC dimension of halfspaces, we start with the homogenous case. 
Theorem 9.2. The VC dimension of the class of homogenous halfspaces in R¢ is d. 


Proof. First, consider the set of vectors e1,...,e7, where for every i the vector e; is 
the all zeros vector except 1 in the i’th coordinate. This set is shattered by the class 
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of homogenous halfspaces. Indeed, for every labeling y1,..., yg, set w= (1,---, Ya), 
and then (w,e;) = y; for alli. 

Next, let x1,...,xg41 be a set of d+1 vectors in R?. Then, there must exist 
real numbers aj,...,d@qg41, not all of them are zero, such that Sy a;x; = 0. Let 


I ={i:a; > 0} and J ={j:a; <0}. Either / or J is nonempty. Let us first assume 
that both of them are nonempty. Then, 


So aixi = > |aj|X;. 


ie] jeJ 
Now, suppose that x;,...,xg;1 are shattered by the class of homogenous classes. 


Then, there must exist a vector w such that (w,x;) > 0 for alli <¢ J while (w,x;) <0 
for every j € J. It follows that 


0<S ai(xi,w) = (Seam) = (Sais.w) = 5° |aj\(x;,w) <0, 


iel ie] jes jes 


which leads to a contradiction. Finally, if J (respectively, /) is empty then the right- 
most (respectively, left-most) inequality should be replaced by an equality, which 
still leads to a contradiction. O 


Theorem 9.3. The VC dimension of the class of nonhomogenous halfspaces in R¢ is 
d+1. 


Proof. First, asin the proof of Theorem 9.2, it is easy to verify that the set of vectors 
0,e1,...,e@¢ is shattered by the class of nonhomogenous halfspaces. Second, suppose 
that the vectors x1,...,Xa+2 are shattered by the class of nonhomogenous halfspaces. 
But, using the reduction we have shown in the beginning of this chapter, it follows 
that there are d +2 vectors in R¢*! that are shattered by the class of homogenous 
halfspaces. But this contradicts Theorem 9.2. O 


9.2 LINEAR REGRESSION 


Linear regression is a common statistical tool for modeling the relationship between 
some “explanatory” variables and some real valued outcome. Cast as a learning 
problem, the domain set ¥ is a subset of R“, for some d, and the label set is the 
set of real numbers. We would like to learn a linear function h : R4 > R that best 
approximates the relationship between our variables (say, for example, predicting 
the weight of a baby as a function of her age and weight at birth). Figure 9.1 shows 
an example of a linear regression predictor for d = 1. 

The hypothesis class of linear regression predictors is simply the set of linear 
functions, 

Hreg = La = {xt (w,x) +b: we R’, be R}. 


Next we need to define a loss function for regression. While in classification 
the definition of the loss is straightforward, as ¢(h, (x, y)) simply indicates whether 
h(x) correctly predicts y or not, in regression, if the baby’s weight is 3 kg, both the 
predictions 3.00001 kg and 4 kg are “wrong,” but we would clearly prefer the former 
over the latter. We therefore need to define how much we shall be “penalized” for 
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with respect to this class, given a training set S, and using the homogenous version 
of La, is to find 


m 


argmin Ls(hw) = argmin — S- ((w,x;) — yi)’. 
w w m i=l 


To solve the problem we calculate the gradient of the objective function and 
compare it to zero. That is, we need to solve 


2: m 
= S°((w, xi) — yi)xi = 0. 
m 4 
i=1 
We can rewrite the problem as the problem Aw = b where 
A= (s “) and b= S > yixi. (9.6) 
i=1 i=1 


Or, in matrix form: 


7 

A=] xX ... Xm Xj... Xm ; (9.7) 
: : Bal 

b=] x1 ... Xm : : (9.8) 
: : Yn 


If A is invertible then the solution to the ERM problem is 
w=A'b. 


The case in which A is not invertible requires a few standard tools from linear alge- 
bra, which are available in Appendix C. It can be easily shown that if the training 
instances do not span the entire space of R¢ then A is not invertible. Nevertheless, 
we can always find a solution to the system Aw = b because b is in the range of A. 
Indeed, since A is symmetric we can write it using its eigenvalue decomposition as 
A=VDV', where D is a diagonal matrix and V is an orthonormal matrix (that is, 
V'V is the identity d x d matrix). Define D* to be the diagonal matrix such that 
D;, =O if D;,; =0 and otherwise D;"; = 1/Dj;,;. Now, define 


i,t 


At=VDtv! and W=A?*b. 
Let v; denote the i’th column of V. Then, we have 


AW = AAtb=VDV'VD*tV'b=VDD*V'b= > yiv/b. 
i:D; ; 40 
That is, Aw is the projection of b onto the span of those vectors v; for which D;,; 4 0. 


Since the linear span of x1,...,X, is the same as the linear span of those v;, and b is 
in the linear span of the x;, we obtain that AW = b, which concludes our argument. 
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9.2.2 Linear Regression for Polynomial Regression Tasks 
Some learning tasks call for nonlinear predictors, such as polynomial predictors. 
Take, for instance, a one dimensional polynomial function of degree n, that is, 


p(x) = ag +ayx +apx* +++» +ayx" 


where (ao,...,@,) is a vector of coefficients of size n + 1. In the following we depict 
a training set that is better fitted using a 3rd degree polynomial predictor than using 
a linear predictor. 


We will focus here on the class of one dimensional, n-degree, polynomial 
regression predictors, namely, 


Fhpaly ={xr P(x)}, 


where p is a one dimensional polynomial of degree n, parameterized by a vector of 
coefficients (ap,...,d,). Note that Y =R, since this is a one dimensional polynomial, 
and Y =R, as this is a regression problem. 

One way to learn this class is by reduction to the problem of linear regression, 
which we have already shown how to solve. To translate a polynomial regression 
problem to a linear regression problem, we define the mapping y :R > R"*! such 
that w(x) = (1,x,x?,...,x”). Then we have that 


PCL (x)) = ag tayx + agx® +++ +anx" = (a, (x) 


and we can find the optimal vector of coefficients a by using the Least Squares 
algorithm as shown earlier. 


9.3 LOGISTIC REGRESSION 


In logistic regression we learn a family of functions h from R¢ to the interval [0, 1]. 
However, logistic regression is used for classification tasks: We can interpret (x) as 
the probability that the label of x is 1. The hypothesis class associated with logistic 
regression is the composition of a sigmoid function ¢cig : R > [0, 1] over the class of 
linear functions Ly. In particular, the sigmoid function used in logistic regression is 
the logistic function, defined as 


1 


1+exp(—z)’ 7) 


Psig (z)= 
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The name “sigmoid” means “S-shaped,” referring to the plot of this function, shown 
in the figure: 


The hypothesis class is therefore (where for simplicity we are using homogenous 
linear functions): 


Haig = Psig 0 La = {X14 dsig((W,x)) : we R4}. 


Note that when (w,x) is very large then ¢.ig((w, x)) is close to 1, whereas if (w, x) 
is very small then ¢ig((w,x)) is close to 0. Recall that the prediction of the halfs- 
pace corresponding to a vector w is sign((w,x)). Therefore, the predictions of the 
halfspace hypothesis and the logistic hypothesis are very similar whenever |(w, x)| is 
large. However, when |(w,x)| is close to 0 we have that ¢sig((w, X)) © 5. Intuitively, 
the logistic hypothesis is not sure about the value of the label so it guesses that the 
label is sign((w, x)) with probability slightly larger than 50%. In contrast, the halfs- 
pace hypothesis always outputs a deterministic prediction of either 1 or —1, even if 
|(w, x)| 1s very close to 0. 

Next, we need to specify a loss function. That is, we should define how bad it is 
to predict some hw(x) € [0, 1] given that the true label is y € {41}. Clearly, we would 
like that hw(x) would be large if y = 1 and that 1 — hy(x) (ie., the probability of 
predicting —1) would be large if y = —1. Note that 


1— h(x) =1- 1 _ exp (— (w, x)) _ 1 
1+exp(—(w,x)) 1+exp(—(w,x)) 1+ exp((w,x)) 
Therefore, any reasonable loss function would increase monotonically with 
TaemGiwao equivalently, would increase monotonically with 1 + exp(— y(w, x)). 
The logistic loss function used in logistic regression penalizes hw based on the log of 
1+ exp(— y(w,x)) (recall that log is a monotonic function). That is, 


l(hw, (x, y)) = log (1 + exp(— y(w,x))) . 


Therefore, given a training set S = (x1, y1),..., (Xm, Ym), the ERM problem associ- 
ated with logistic regression is 


m 
argmin— } “log (1+ exp(— y;(w,x;))) . (9.10) 
werd M7 

The advantage of the logistic loss function is that it is a convex function with respect 
to w; hence the ERM problem can be solved efficiently using standard methods. 
We will study how to learn with convex functions, and in particular specify a simple 

algorithm for minimizing convex functions, in later chapters. 
The ERM problem associated with logistic regression (Equation (9.10)) is iden- 
tical to the problem of finding a Maximum Likelihood Estimator, a well-known 
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statistical approach for finding the parameters that maximize the joint probability of 
a given data set assuming a specific parametric probability function. We will study 
the Maximum Likelihood approach in Chapter 24. 


9.4 SUMMARY 


The family of linear predictors is one of the most useful families of hypothesis 
classes, and many learning algorithms that are being widely used in practice rely 
on linear predictors. We have shown efficient algorithms for learning linear predic- 
tors with respect to the zero-one loss in the separable case and with respect to the 
squared and logistic losses in the unrealizable case. In later chapters we will present 
the properties of the loss function that enable efficient learning. 

Naturally, linear predictors are effective whenever we assume, as prior knowl- 
edge, that some linear predictor attains low risk with respect to the underlying 
distribution. In the next chapter we show how to construct nonlinear predictors by 
composing linear predictors on top of simple classes. This will enable us to employ 
linear predictors for a variety of prior knowledge assumptions. 


9.5 BIBLIOGRAPHIC REMARKS 


The Perceptron algorithm dates back to Rosenblatt (1958). The proof of its conver- 
gence rate is due to (Agmon 1954, Novikoff 1962). Least Squares regression goes 
back to Gauss (1795), Legendre (1805), and Adrain (1808). 


9.6 EXERCISES 


9.1 Show how to cast the ERM problem of linear regression with respect to the absolute 
value loss function, ¢(h, (x, y)) = |A(x) — y|, as a linear program; namely, show how 


to write the problem 
m 


min 7 |(w.x;) — yi| 
i=1 


as a linear program. 
Hint: Start with proving that for any c € R, 


lc| = mina s.t. c<aandc>-—a. 
a>0 


9.2 Show that the matrix A defined in Equation (9.6) is invertible if and only if x1, ..., xX 
span R¢. 

9.3 Show that Theorem 9.1 is tight in the following sense: For any positive integer m, 
there exist a vector w* € R¢ (for some appropriate d) and a sequence of examples 
{(X1, 1), ---, (Xm, Ym)} Such that the following hold: 

@ R=max; ||x;|| <1. 

H ||w*||? =m, and for all i < m, y;(x;,w*) > 1. Note that, using the notation in 

Theorem 9.1, we therefore get 


B=min{||w||:Vie[m], y;(w.x;) > 1} < Jim. 


Thus, (BR)* <m. 
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9.4 


9.5 


9.6 


® When running the Perceptron on this sequence of examples it makes m updates 
before converging. 

Hint: Choose d = m and for every i choose x; = e;. 

(*) Given any number m, find an example of a sequence of labeled examples 
((X1, y1), +--+ (XmsYm)) € (R? x {-1, +1})” on which the upper bound of Theorem 9.1 
equals m and the perceptron algorithm is bound to make m mistakes. 
Hint: Set each x; to be a third dimensional vector of the form (a,b, y;), where 
a’ +b? = R* —1. Let w* be the vector (0,0, 1). Now, go over the proof of the Per- 
ceptron’s upper bound (Theorem 9.1), see where we used inequalities (<) rather 
than equalities (=), and figure out scenarios where the inequality actually holds with 
equality. 
Suppose we modify the Perceptron algorithm as follows: In the update step, instead 
of performing w’+) = w") + y;x; whenever we make a mistake, we perform wt!) = 
w() + ny;x; for some n > 0. Prove that the modified Perceptron will perform the 
same number of iterations as the vanilla Perceptron and will converge to a vector 
that points to the same direction as the output of the vanilla Perceptron. 
In this problem, we will get bounds on the VC-dimension of the class of (closed) 
balls in R¢, that is, 


Ba = {By :V € R7,r > 0}, 
where 


1 if ||Ix—v||<r 
By r(x) = 0 otherwise 
1. Consider the mapping ¢ : R4 + R@*! defined by (x) = (x, ||x||?). Show that if 
X1,.--,X are shattered by Bz then $(x1),...,¢(X) are shattered by the class of 
halfspaces in R¢+! (in this question we assume that sign(0) = 1). What does this 
tell us about VCdim(B,)? 
2. (*) Find a set of d+1 points in R? that is shattered by By. Conclude that 


d+1< VCdim(Bz) < d+2. 
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Boosting is an algorithmic paradigm that grew out of a theoretical question and 
became a very practical machine learning tool. The boosting approach uses a gen- 
eralization of linear predictors to address two major issues that have been raised 
earlier in the book. The first is the bias-complexity tradeoff. We have seen (in 
Chapter 5) that the error of an ERM learner can be decomposed into a sum of 
approximation error and estimation error. The more expressive the hypothesis class 
the learner is searching over, the smaller the approximation error is, but the larger 
the estimation error becomes. A learner is thus faced with the problem of picking a 
good tradeoff between these two considerations. The boosting paradigm allows the 
learner to have smooth control over this tradeoff. The learning starts with a basic 
class (that might have a large approximation error), and as it progresses the class 
that the predictor may belong to grows richer. 

The second issue that boosting addresses is the computational complexity of 
learning. As seen in Chapter 8, for many interesting concept classes the task of 
finding an ERM hypothesis may be computationally infeasible. A boosting algo- 
rithm amplifies the accuracy of weak learners. Intuitively, one can think of a weak 
learner as an algorithm that uses a simple “rule of thumb” to output a hypothesis 
that comes from an easy-to-learn hypothesis class and performs just slightly better 
than a random guess. When a weak learner can be implemented efficiently, boost- 
ing provides a tool for aggregating such weak hypotheses to approximate gradually 
good predictors for larger, and harder to learn, classes. 

In this chapter we will describe and analyze a practically useful boosting algo- 
rithm, AdaBoost (a shorthand for Adaptive Boosting). The AdaBoost algorithm 
outputs a hypothesis that is a linear combination of simple hypotheses. In other 
words, AdaBoost relies on the family of hypothesis classes obtained by composing 
a linear predictor on top of simple classes. We will show that AdaBoost enables us 
to control the tradeoff between the approximation and estimation errors by varying 
a single parameter. 

AdaBoost demonstrates a general theme, that will recur later in the book, of 
expanding the expressiveness of linear predictors by composing them on top of 
other functions. This will be elaborated in Section 10.3. 
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AdaBoost stemmed from the theoretical question of whether an efficient weak 
learner can be “boosted” into an efficient strong learner. This question was raised by 
Kearns and Valiant in 1988 and solved in 1990 by Robert Schapire, then a graduate 
student at MIT. However, the proposed mechanism was not very practical. In 1995, 
Robert Schapire and Yoav Freund proposed the AdaBoost algorithm, which was 
the first truly practical implementation of boosting. This simple and elegant algo- 
rithm became hugely popular, and Freund and Schapire’s work has been recognized 
by numerous awards. 

Furthermore, boosting is a great example for the practical impact of learning the- 
ory. While boosting originated as a purely theoretical problem, it has led to popular 
and widely used algorithms. Indeed, as we shall demonstrate later in this chapter, 
AdaBoost has been successfully used for learning to detect faces in images. 


10.1 WEAK LEARNABILITY 


Recall the definition of PAC learning given in Chapter 3: A hypothesis class, H, 
is PAC learnable if there exist mz : (0,1)* > N and a learning algorithm with the 
following property: For every €,6 € (0,1), for every distribution D over %, and for 
every labeling function f : ¥ > {+1}, if the realizable assumption holds with respect 
toH,D, f, then when running the learning algorithm on m > m7(e, 5) 1.i.d. examples 
generated by D and labeled by f, the algorithm returns a hypothesis / such that, 
with probability of at least 1 — 8, Lip, py(h) Se. 

Furthermore, the fundamental theorem of learning theory (Theorem 6.8 in 
Chapter 6) characterizes the family of learnable classes and states that every PAC 
learnable class can be learned using any ERM algorithm. However, the definition of 
PAC learning and the fundamental theorem of learning theory ignores the compu- 
tational aspect of learning. Indeed, as we have shown in Chapter 8, there are cases in 
which implementing the ERM rule is computationally hard (even in the realizable 
case). 

However, perhaps we can trade computational hardness with the requirement 
for accuracy. Given a distribution D and a target labeling function f, maybe there 
exists an efficiently computable learning algorithm whose error is just slightly better 
than a random guess? This motivates the following definition. 


Definition 10.1 (y-Weak-Learnability). 


® A learning algorithm, A, is a y-weak-learner for a class 1 if there exists a func- 
tion m7, : (0,1) > N such that for every 6 € (0,1), for every distribution D over 
X, and for every labeling function f : ¥ > {+1}, if the realizable assumption 
holds with respect to H,D, f, then when running the learning algorithm on m > 
my (6) iid. examples generated by D and labeled by f, the algorithm returns a 
hypothesis / such that, with probability of at least 1 — 6, Lip, p)(h) < 1/2—y. 

® A hypothesis class H is y-weak-learnable if there exists a y-weak-learner for 
that class. 


This definition is almost identical to the definition of PAC learning, which here 
we will call strong learning, with one crucial difference: Strong learnability implies 
the ability to find an arbitrarily good classifier (with error rate at most ¢ for an 
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arbitrarily small « > 0). In weak learnability, however, we only need to output a 
hypothesis whose error rate is at most 1/2 — y, namely, whose error rate is slightly 
better than what a random labeling would give us. The hope is that it may be easier 
to come up with efficient weak learners than with efficient (full) PAC learners. 

The fundamental theorem of learning (Theorem 6.8) states that if a hypothesis 
class H has a VC dimension d, then the sample complexity of PAC learning H satis- 
fies mz(€, 5) > Cy atlog(t/) where C} is aconstant. Applying this with «e = 1/2—y we 
immediately obtain that if d = oo then H is not y-weak-learnable. This implies that 
from the statistical perspective (i.e., if we ignore computational complexity), weak 
learnability is also characterized by the VC dimension of 1 and therefore is just as 
hard as PAC (strong) learning. However, when we do consider computational com- 
plexity, the potential advantage of weak learning is that maybe there is an algorithm 
that satisfies the requirements of weak learning and can be implemented efficiently. 

One possible approach is to take a “simple” hypothesis class, denoted B, and to 
apply ERM with respect to B as the weak learning algorithm. For this to work, we 
need that B will satisfy two requirements: 


® ERMsg is efficiently implementable. 
™ For every sample that is labeled by some hypothesis from H, any ERMs 
hypothesis will have an error of at most 1/2—y. 


Then, the immediate question is whether we can boost an efficient weak learner 
into an efficient strong learner. In the next section we will show that this is indeed 
possible, but before that, let us show an example in which efficient weak learnability 
of a class H is possible using a base hypothesis class B. 


Example 10.1 (Weak Learning of 3-Piece Classifiers Using Decision Stumps). Let 
« = R and let H be the class of 3-piece classifiers, namely, H = {ho, 6,5 : 1,02 € 
R, 0 < 62,b € {£1}}, where for every x, 


+b ifx <0,0rx > 
—b if0,<x<% 


ho, ,,b(X) = 


An example hypothesis (for b = 1) is illustrated as follows: 
+ - + 


04 9, 


Let B be the class of Decision Stumps, that is, B= {x > sign(x—6)-b: GER, be 
{1}}. In the following we show that ERMg is a y-weak learner for H, for y = 1/12. 

To see that, we first show that for every distribution that is consistent with H, 
there exists a decision stump with Lp(h) < 1/3. Indeed, just note that every classifier 
in H consists of three regions (two unbounded rays and a center interval) with alter- 
nate labels. For any pair of such regions, there exists a decision stump that agrees 
with the labeling of these two components. Note that for every distribution D over 
R and every partitioning of the line into three such regions, one of these regions 
must have D-weight of at most 1/3. Let h € H be a zero error hypothesis. A decision 
stump that disagrees with h only on such a region has an error of at most 1/3. 
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Finally, since the VC-dimension of decision stumps is 2, if the sample size is 
greater than Q(log(1/5)/e7), then with probability of at least 1 — 5, the ERMg rule 
returns a hypothesis with an error of at most 1/3+ €. Setting €e = 1/12 we obtain that 
the error of ERMz is at most 1/3 + 1/12 = 1/2 —1/12. 

We see that ERMz is a y-weak learner for H. We next show how to implement 
the ERM rule efficiently for decision stumps. 


10.1.1 Efficient Implementation of ERM for Decision Stumps 


Let X = R¢@ and consider the base hypothesis class of decision stumps over R¢, 
namely, 


Hps = {xt sign(@—x;)-b: 6€ER,i € [d],b € {4+1}}. 


For simplicity, assume that b = 1; that is, we focus on all the hypotheses in Hpg of the 
form sign(@ — x;). Let S = ((x1, y1),..-, (Xm, Ym)) be a training set. We will show how 
to implement an ERM rule, namely, how to find a decision stump that minimizes 
Ls(h). Furthermore, since in the next section we will show that AdaBoost requires 
finding a hypothesis with a small risk relative to some distribution over S, we will 
show here how to minimize such risk functions. Concretely, let D be a probability 
vector in R” (that is, all elements of D are nonnegative and }°; Dj = 1). The weak 
learner we describe later receives D and S and outputs a decision stump h: ¥ > Y 
that minimizes the risk w.r.t. D, 


m 
Lp(h) = Yo Dilincx)¢yi}- 
i=1 
Note that if D = (1/m,...,1/m) then Lp(h) = Ls(h). 
Recall that each decision stump is parameterized by an index j € [d] and a 
threshold 6. Therefore, minimizing Lp(/) amounts to solving the problem 


; : Dy Ary, , Dl, , ; 10.1 
min mip | tut De bs) (10.1) 


Fix j € [d] and let us sort the examples so that x1; < 2x2; <... < X%m,j;. Define 
Oj; = {His 21 € [m —1]}U {(x1,; — 1), @m,j + 1}. Note that for any 6 € R there 
exists 6’ € ©; that yields the same predictions for the sample S as the threshold 0. 
Therefore, instead of minimizing over 6 € R we can minimize over 0 € 0}. 

This already gives us an efficient procedure: Choose j € [d] and 6 € ©; that 
minimize the objective value of Equation (10.1). For every j and 6 € ©; we have to 
calculate a sum over m examples; therefore the runtime of this approach would be 
O(dm7). We next show a simple trick that enables us to minimize the objective in 
time O(dm). 

The observation is as follows. Suppose we have calculated the objective for 6 € 
(xj-1,;,%i,;). Let F(@) be the value of the objective. Then, when we consider 6’ € 
(xi,j,Xi41,;) we have that 


F(6’) = F(@) — Diy, 1) + Dily,=--1) = FO) — yiDi. 
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Therefore, we can calculate the objective at 6’ in a constant time, given the objective 
at the previous threshold, @. It follows that after a preprocessing step in which we 
sort the examples with respect to each coordinate, the minimization problem can be 
performed in time O(dm). This yields the following pseudocode. 


ERM for Decision Stumps 


input: 
training set S = (x1, y1),.--, (Xm, Ym) 
distribution vector D 
goal: Find j*, * that solve Equation (10.1) 
initialize: F* = oo 
for j=1,...,d 
sort S using the j’th coordinate, and denote 


def 
X1,j S25 S00 S Xm,j SXm41,j = Xm j+1 


rs ined Di 
if F < F* 


Fe=F,6*=x1,;-1,j*=j 
fori=1,...,m 
F=F—y;,D; 
if F < F* and x;,; A Xi-+41,; 
Ft = F,0* = 5(41,j+3141,))) =) 
output j*,0* 


2 ADABOOST 


AdaBoost (short for Adaptive Boosting) is an algorithm that has access to a weak 
learner and finds a hypothesis with a low empirical risk. The AdaBoost algorithm 
receives as input a training set of examples S = (x1, y1),...,(Xm,¥m), Where for 
each i, y; = f(x;) for some labeling function f. The boosting process proceeds in 
a sequence of consecutive rounds. At round f, the booster first defines a distribution 
over the examples in S, denoted D“. That is, D® € R” and 7", p” = 1. Then, 
the booster passes the distribution D“) and the sample S to the weak learner. (That 
way, the weak learner can construct i.i.d. examples according to D® and f.) The 
weak learner is assumed to return a “weak” hypothesis, h;, whose error, 


m 
def def 
= Lyw (hr) = oD a ereni) 
i=1 


is at most $ —y (of course, there is a probability of at most 6 that the weak learner 


fails). Then, AdaBoost assigns a weight for h, as follows: w; = 5 log (2 - i) That 
is, the weight of h, is inversely proportional to the error of h;. At the end of the 
round, AdaBoost updates the distribution so that examples on which h; errs will 
get a higher probability mass while examples on which h; is correct will get a lower 
probability mass. Intuitively, this will force the weak learner to focus on the prob- 
lematic examples in the next round. The output of the AdaBoost algorithm is a 
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“strong” classifier that is based on a weighted sum of all the weak hypotheses. The 
pseudocode of AdaBoost is presented in the following. 


AdaBoost 


input: 
training set S = (x1, y1),.--, (Xm, Ym) 
weak learner WL 
number of rounds T 
Bae ye nns 1) sl 1 
initialize D{) = (3 hing —): 
fort =1,...,T: 
invoke weak learner h; = WL(D” LS ) 
m (t) 
(1D; My; 2n,(x)] 
let w; = 5 log z-1 


compute €, = )> 


pe ee 
update pot) — By Bk wag ehie) for alli =1,...,m 


DM, DY exp (—wryjhr(x;)) 


output the hypothesis h;(x) = sign (4 wihi(%)) , 


The following theorem shows that the training error of the output hypothesis 
decreases exponentially fast with the number of boosting rounds. 


Theorem 10.2. Let S be a training set and assume that at each iteration of AdaBoost, 
the weak learner returns a hypothesis for which e, < 1/2 —y. Then, the training error 
of the output hypothesis of AdaBoost is at most 


1 m 
Ls(hs) = — > MnscxidAyi] s exp(—2y* Tr). 
i=1 


Proof. For each t, denote f; = > 
fr. 1n addition, denote 


Wphp. Therefore, the output of AdaBoost is 


pst 


1 m 

Z, = —V eT AGH), 
=o > 
i= 


Note that for any hypothesis we have that tj(,)zy] < eY'(*), Therefore, Ls(fr) < 
Zr, So it suffices to show that Zr < e2”T To upper bound Zr we rewrite it as 


sc II oe Nc (10.2) 


ZT — 
Zo Zr-1 Zr-2 Z, Zo 


where we used the fact that Zp = 1 because fy =0. Therefore, it suffices to show that 
for every round f, 
2 


Z 
tee, (10.3) 


t 


To do so, we first note that using a simple inductive argument, for all t andi, 


(+41) _ ei fii) 


a= ee AD) 
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learner that finds the minimum value of Lp(h) for decision stumps, as described in 
the previous section. 


Theorem 10.2 tells us that the empirical risk of the hypothesis constructed by 
AdaBoost goes to zero as T grows. However, what we really care about is the true 
risk of the output hypothesis. To argue about the true risk, we note that the output 
of AdaBoost is in fact a composition of a halfspace over the predictions of the T 
weak hypotheses constructed by the weak learner. In the next section we show that 
if the weak hypotheses come from a base hypothesis class of low VC-dimension, 
then the estimation error of AdaBoost will be small; namely, the true risk of the 
output of AdaBoost would not be very far from its empirical risk. 


10.3 LINEAR COMBINATIONS OF BASE HYPOTHESES 


As mentioned previously, a popular approach for constructing a weak learner is to 
apply the ERM rule with respect to a base hypothesis class (e.g., ERM over decision 
stumps). We have also seen that boosting outputs a composition of a halfspace over 
the predictions of the weak hypotheses. Therefore, given a base hypothesis class B 
(e.g., decision stumps), the output of AdaBoost will be a member of the following 
class: 


T 
L(B,T)= p +> sign (doma0o) :weR’, Vt, he o| ; (10.4) 


f=1 


That is, each h € L(B,T) is parameterized by T base hypotheses from B and by 
a vector w € R’. The prediction of such an h on an instance x is obtained by first 
applying the T base hypotheses to construct the vector w(x) = (f1(x),...,4r(x)) € 
R’, and then applying the (homogenous) halfspace defined by w on w(x). 

In this section we analyze the estimation error of L(B,T) by bounding the VC- 
dimension of L(B, T) in terms of the VC-dimension of B and T. We will show that, 
up to logarithmic factors, the VC-dimension of L(B, 7) is bounded by T times the 
VC-dimension of B. It follows that the estimation error of AdaBoost grows linearly 
with T. On the other hand, the empirical risk of AdaBoost decreases with T. In 
fact, as we demonstrate later, T can be used to decrease the approximation error 
of L(B,T). Therefore, the parameter T of AdaBoost enables us to control the bias- 
complexity tradeoff. 

To demonstrate how the expressive power of L(B, 7) increases with T, consider 
the simple example, in which ¥ = R and the base class is Decision Stumps, 


Hpsi = {xb sign(x —0)-b: 0€R,be {+l}}. 


Note that in this one dimensional case, Hps is in fact equivalent to (nonhomoge- 
nous) halfspaces on R. 

Now, let H be the rather complex class (compared to halfspaces on the line) of 
piece-wise constant functions. Let g, be a piece-wise constant function with at most 
r pieces; that is, there exist thresholds —0o = 6p < 61 < 62 <--- <0, =0o such that 


ay 
— 


gr(x) = So ai Ue o_1.64Il Vi, aj € {+ 
i=l 
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A B 


Cc D 
Figure 10.1. The four types of functions, g, used by the base hypotheses for face recogni- 
tion. The value of g for type A or B is the difference between the sum of the pixels within 
two rectangular regions. These regions have the same size and shape and are horizontally 
or vertically adjacent. For type C, the value of g is the sum within two outside rectangles 
subtracted from the sum in a center rectangle. For type D, we compute the difference 
between diagonal pairs of rectangles. 


hd 
Figure 10.2. The first and second features selected by AdaBoost, as implemented by Viola 
and Jones. The two features are shown in the top row and then overlaid on a typical train- 
ing face in the bottom row. The first feature measures the difference in intensity between 
the region of the eyes and a region across the upper cheeks. The feature capitalizes on 
the observation that the eye region is often darker than the cheeks. The second feature 
compares the intensities in the eye regions to the intensity across the bridge of the nose. 


10.5 SUMMARY 


Boosting is a method for amplifying the accuracy of weak learners. In this chapter 
we described the AdaBoost algorithm. We have shown that after T iterations of 
AdaBoost, it returns a hypothesis from the class L(B,T), obtained by composing a 
linear classifier on T hypotheses from a base class B. We have demonstrated how the 
parameter T controls the tradeoff between approximation and estimation errors. In 
the next chapter we will study how to tune parameters such as T, on the basis of the 
data. 


10.6 BIBLIOGRAPHIC REMARKS 


As mentioned before, boosting stemmed from the theoretical question of whether 
an efficient weak learner can be “boosted” into an efficient strong learner (Kearns 
& Valiant 1988) and solved by Schapire (1990). The AdaBoost algorithm has been 
proposed in Freund and Schapire (1995). 
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Boosting can be viewed from many perspectives. In the purely theoretical con- 
text, AdaBoost can be interpreted as a negative result: If strong learning of a 
hypothesis class is computationally hard, so is weak learning of this class. This neg- 
ative result can be useful for showing hardness of agnostic PAC learning of a class 
B based on hardness of PAC learning of some other class 1, as long as H is weakly 
learnable using B. For example, Klivans and Sherstov (2006) have shown that PAC 
learning of the class of intersection of halfspaces is hard (even in the realizable case). 
This hardness result can be used to show that agnostic PAC learning of a single half- 
space is also computationally hard (Shalev-Shwartz, Shamir & Sridharan 2010). The 
idea is to show that an agnostic PAC learner for a single halfspace can yield a weak 
learner for the class of intersection of halfspaces, and since such a weak learner can 
be boosted, we will obtain a strong learner for the class of intersection of halfspaces. 

AdaBoost also shows an equivalence between the existence of a weak learner 
and separability of the data using a linear classifier over the predictions of base 
hypotheses. This result is closely related to von Neumann’s minimax theorem (von 
Neumann 1928), a fundamental result in game theory. 

AdaBoost is also related to the concept of margin, which we will study later 
on in Chapter 15. It can also be viewed as a forward greedy selection algorithm, a 
topic that will be presented in Chapter 25. A recent book by Schapire and Freund 
(2012) covers boosting from all points of view and gives easy access to the wealth of 
research that this field has produced. 


10.7 EXERCISES 


10.1 Boosting the Confidence: Let A be an algorithm that guarantees the following: 
There exist some constant 59 € (0,1) and a function mz : (0,1) > N such that 
for every € € (0,1), if m > m3z,(e) then for every distribution D it holds that with 
probability of at least 1 — 69, Lp(A(S)) < minney Lo(h) +e. 

Suggest a procedure that relies on A and learns H in the usual agnostic PAC 
learning model and has a sample complexity of 


my(€,5) <kmy(e)+ a 


where 
k = [log (5)/log (50)1. 


Hint: Divide the data into k + 1 chunks, where each of the first k chunks is of size 
my(€) examples. Train the first k chunks using A. Argue that the probability that 
for all of these chunks we have Lp(A(S)) > minjex Lp(h) +€ is at most 54 < 5/2. 
Finally, use the last chunk to choose from the k hypotheses that A generated from 
the k chunks (by relying on Corollary 4.6). 

10.2 Prove that the function 4 given in Equation (10.5) equals the piece-wise constant 
function defined according to the same thresholds as h. 

10.3 We have informally argued that the AdaBoost algorithm uses the weighting mech- 
anism to “force” the weak learner to focus on the problematic examples in the next 
iteration. In this question we will find some rigorous justification for this argument. 
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10.7 Exercises 


Show that the error of h, w.r.t. the distribution D+) is exactly 1/2. That is, show 
that for every t € [T] 


m 
s a Uy; 4he(x;)] = 1/2. 
i=1 

In this exercise we discuss the VC-dimension of classes of the form L(B,T). We 

proved an upper bound of O(dT log(dT)), where d = VCdim(B). Here we wish to 
prove an almost matching lower bound. However, that will not be the case for all 

classes B. 

1. Note that for every class B and every number T > 1, VCdim(B) < 
VCdim(L(B,7)). Find a class B for which VCdim(B) = VCdim(L(B, T)) for 
every T > 1. 

Hint: Take % to be a finite set. 
2. Let By be the class of decision stumps over R?. Prove that log(d) < 
VCdim(Ba) < 5+ 2log(d). 
Hints: 
For the upper bound, rely on Exercise 10.11. 
For the lower bound, assume d = 2*. Let A be ak x d matrix whose columns 
are all the d binary vectors in {+1}*. The rows of A form a set of k vectors 
in R¢. Show that this set is shattered by decision stumps over R?. 

3. Let T > 1 be any integer. Prove that VCdim(L(Ba, T)) > 0.5T log(d). 

Hint: Construct a set of tk instances by taking the rows of the matrix A from 
the previous question, and the rows of the matrices 2A,3A,4A,..., tA. Show 
that the resulting set is shattered by L( Bg, T). 

Efficiently Calculating the Viola and Jones Features Using an Integral Image: Let 

A be a 24 x 24 matrix representing an image. The integral image of A, denoted by 

1(A), is the matrix B such that Bj, j = )0j<;, j-<j Ai,j- 

M@ Show that /(A) can be calculated from A in time linear in the size of A. 

® Show how every Viola and Jones feature can be calculated from /(A) in a con- 
stant amount of time (that is, the runtime does not depend on the size of the 
rectangle defining the feature). 
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In the previous chapter we have described the AdaBoost algorithm and have shown 
how the parameter T of AdaBoost controls the bias-complexity tradeoff. But how 
do we set T in practice? More generally, when approaching some practical problem, 
we usually can think of several algorithms that may yield a good solution, each of 
which might have several parameters. How can we choose the best algorithm for the 
particular problem at hand? And how do we set the algorithm’s parameters? This 
task is often called model selection. 

To illustrate the model selection task, consider the problem of learning a one 
dimensional regression function, 4 : R > R. Suppose that we obtain a training set as 
depicted in the figure. 


We can consider fitting a polynomial to the data, as described in Chapter 9. How- 
ever, we might be uncertain regarding which degree d would give the best results 
for our data set: A small degree may not fit the data well (i.e., it will have a large 
approximation error), whereas a high degree may lead to overfitting (i.e., it will have 
a large estimation error). In the following we depict the result of fitting a polyno- 
mial of degrees 2, 3, and 10. It is easy to see that the empirical risk decreases as we 
enlarge the degree. However, looking at the graphs, our intuition tells us that setting 
the degree to 3 may be better than setting it to 10. It follows that the empirical risk 
alone is not enough for model selection. 


www.EngineeringBooksLibrary.com 


11.1 Model Selection Using SRM 


Degree 2 Degree 3 Degree 10 


In this chapter we will present two approaches for model selection. The first 
approach is based on the Structural Risk Minimization (SRM) paradigm we have 
described and analyzed in Chapter 7.2. SRM is particularly useful when a learning 
algorithm depends on a parameter that controls the bias-complexity tradeoff (such 
as the degree of the fitted polynomial in the preceding example or the parameter T 
in AdaBoost). The second approach relies on the concept of validation. The basic 
idea is to partition the training set into two sets. One is used for training each of the 
candidate models, and the second is used for deciding which of them yields the best 
results. 

In model selection tasks, we try to find the right balance between approxima- 
tion and estimation errors. More generally, if our learning algorithm fails to find 
a predictor with a small risk, it is important to understand whether we suffer from 
overfitting or underfitting. In Section 11.3 we discuss how this can be achieved. 


11.1 MODEL SELECTION USING SRM 


The SRM paradigm has been described and analyzed in Section 7.2. Here we show 
how SRM can be used for tuning the tradeoff between bias and complexity without 
deciding on a specific hypothesis class in advance. Consider a countable sequence 
of hypothesis classes H1,H2,H3,.... For example, in the problem of polynomial 
regression mentioned, we can take Hy to be the set of polynomials of degree at 
most d. Another example is taking Hy to be the class L(B,d) used by AdaBoost, as 
described in the previous chapter. 

We assume that for every d, the class Hg enjoys the uniform convergence prop- 
erty (see Definition 4.3 in Chapter 4) with a sample complexity function of the 
form 


mY (c,8) < ee (11.1) 


where g:N > Ris some monotonically increasing function. For example, in the case 
of binary classification problems, we can take g(d) to be the VC-dimension of the 
class Hg multiplied by a universal constant (the one appearing in the fundamental 
theorem of learning; see Theorem 6.8). For the classes L(B,d) used by AdaBoost, 
the function g will simply grow with d. 

Recall that the SRM rule follows a “bound minimization” approach, where in 
our case the bound is as follows: With probability of at least 1 — 6, for every d e N 
andhe Ha, 


Lp(h) < Lath) + | $DUMoB/8) + 2log(d) + 08(77/9) ay» 
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This bound, which follows directly from Theorem 7.4, shows that for every d and 
every h € Ha, the true risk is bounded by two terms — the empirical risk, Ls(h), and 
a complexity term that depends on d. The SRM rule will search for d and h € Ha 
that minimize the right-hand side of Equation (11.2). 

Getting back to the example of polynomial regression described earlier, even 
though the empirical risk of the 10th degree polynomial is smaller than that of the 
3rd degree polynomial, we would still prefer the 3rd degree polynomial since its 
complexity (as reflected by the value of the function g(d)) is much smaller. 

While the SRM approach can be useful in some situations, in many practical 
cases the upper bound given in Equation (11.2) is pessimistic. In the next section we 
present a more practical approach. 


11.2 VALIDATION 


We would often like to get a better estimation of the true risk of the output predictor 
of a learning algorithm. So far we have derived bounds on the estimation error of 
a hypothesis class, which tell us that for all hypotheses in the class, the true risk 
is not very far from the empirical risk. However, these bounds might be loose and 
pessimistic, as they hold for all hypotheses and all possible data distributions. A 
more accurate estimation of the true risk can be obtained by using some of the 
training data as a validation set, over which one can evalutate the success of the 
algorithm’s output predictor. This procedure is called validation. 

Naturally, a better estimation of the true risk is useful for model selection, as we 
will describe in Section 11.2.2. 


11.2.1 Hold Out Set 


The simplest way to estimate the true error of a predictor h is by sampling an addi- 
tional set of examples, independent of the training set, and using the empirical error 
on this validation set as our estimator. Formally, let V = (x1, y1),...,(Xm,>Ym,) bea 
set of fresh m, examples that are sampled according to D (independently of the m 
examples of the training set S). Using Hoeffding’s inequality (Lemma 4.5) we have 
the following: 


Theorem 11.1. Leth be some predictor and assume that the loss function is in [0, 1]. 
Then, for every & € (0,1), with probability of at least 1 — 6 over the choice of a 
validation set V of size my we have 


lLv(t) Ly] < | EO. 
my 
The bound in Theorem 11.1 does not depend on the algorithm or the training 
set used to construct / and is tighter than the usual bounds that we have seen so far. 
The reason for the tightness of this bound is that it is in terms of an estimate on a 
fresh validation set that is independent of the way h was generated. To illustrate this 
point, suppose that h was obtained by applying an ERM predictor with respect to a 
hypothesis class of VC-dimension d, over a training set of m examples. Then, from 
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the fundamental theorem of learning (Theorem 6.8) we obtain the bound 


Lo(h) < Ls(h) + yc SBE) 


where C is the constant appearing in Theorem 6.8. In contrast, from Theorem 11.1 
we obtain the bound 


Lo(h) < Ly(h) + | 82) 
2m, 
Therefore, taking m, to be order of m, we obtain an estimate that is more accurate 
by a factor that depends on the VC-dimension. On the other hand, the price we 
pay for using such an estimate is that it requires an additional sample on top of the 
sample used for training the learner. 

Sampling a training set and then sampling an independent validation set is equiv- 
alent to randomly partitioning our random set of examples into two parts, using one 
part for training and the other one for validation. For this reason, the validation set 
is often referred to as a hold out set. 


11.2.2 Validation for Model Selection 


Validation can be naturally used for model selection as follows. We first train dif- 
ferent algorithms (or the same algorithm with different parameters) on the given 
training set. Let H = {h1,...,h,} be the set of all output predictors of the different 
algorithms. For example, in the case of training polynomial regressors, we would 
have each h; be the output of polynomial regression of degree r. Now, to choose a 
single predictor from we sample a fresh validation set and choose the predictor 
that minimizes the error over the validation set. In other words, we apply ERM, 
over the validation set. 

This process is very similar to learning a finite hypothesis class. The only differ- 
ence is that H is not fixed ahead of time but rather depends on the training set. 
However, since the validation set is independent of the training set we get that 
it is also independent of 1 and therefore the same technique we used to derive 
bounds for finite hypothesis classes holds here as well. In particular, combining 
Theorem 11.1 with the union bound we obtain: 


Theorem 11.2. Let H = {h1,...,h,-} be an arbitrary set of predictors and assume that 
the loss function is in [0,1]. Assume that a validation set V of size m, is sampled 
independent of H. Then, with probability of at least 1— 6 over the choice of V we 


have 
Lp(h)—Ly(a)| < | SSE) 


This theorem tells us that the error on the validation set approximates the true 
error as long as H is not too large. However, if we try too many methods (resulting 
in || that is large relative to the size of the validation set) then we’re in danger of 
overfitting. 


WheH, 
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To illustrate how validation is useful for model selection, consider again the 
example of fitting a one dimensional polynomial as described in the beginning of 
this chapter. In the following we depict the same training set, with ERM polynomi- 
als of degree 2, 3, and 10, but this time we also depict an additional validation set 
(marked as red, unfilled circles). The polynomial of degree 10 has minimal training 
error, yet the polynomial of degree 3 has the minimal validation error, and hence it 
will be chosen as the best model. 


11.2.3 The Model-Selection Curve 


The model selection curve shows the training error and validation error as a function 
of the complexity of the model considered. For example, for the polynomial fitting 
problem mentioned previously, the curve will look like: 


0.4 L T T T T T | a Train 
—e— Validation 
0.3 + 4 
i=l 
2 02+ 4 
Lal 
faq| 
0.1 + 4 
(0) = J 
1 | | 1 i 


As can be shown, the training error is monotonically decreasing as we increase the 
polynomial degree (which is the complexity of the model in our case). On the other 
hand, the validation error first decreases but then starts to increase, which indicates 
that we are starting to suffer from overfitting. 

Plotting such curves can help us understand whether we are searching the correct 
regime of our parameter space. Often, there may be more than a single parameter 
to tune, and the possible number of values each parameter can take might be quite 
large. For example, in Chapter 13 we describe the concept of regularization, in which 
the parameter of the learning algorithm is a real number. In such cases, we start 
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understanding the exact behavior of cross validation is still an open problem. Rogers 
and Wagner (Rogers & Wagner 1978) have shown that for k local rules (e.g., k 
Nearest Neighbor; see Chapter 19) the cross validation procedure gives a very good 
estimate of the true error. Other papers show that cross validation works for stable 
algorithms (we will study stability and its relation to learnability in Chapter 13). 


.2.5 Train-Validation-Test Split 


In most practical applications, we split the available examples into three sets. The 
first set is used for training our algorithm and the second is used as a validation set 
for model selection. After we select the best model, we test the performance of the 
output predictor on the third set, which is often called the “test set.” The number 
obtained is used as an estimator of the true error of the learned predictor. 


11.3 WHAT TO DO IF LEARNING FAILS 


Consider the following scenario: You were given a learning task and have 
approached it with a choice of a hypothesis class, a learning algorithm, and param- 
eters. You used a validation set to tune the parameters and tested the learned 
predictor on a test set. The test results, unfortunately, turn out to be unsatisfactory. 
What went wrong then, and what should you do next? 

There are many elements that can be “fixed.” The main approaches are listed in 
the following: 


tial 


Get a larger sample 
| Change the hypothesis class by 
— Enlarging it 
— Reducing it 
— Completely changing it 
— Changing the parameters you consider 
™ Change the feature representation of the data 
™ Change the optimization algorithm used to apply your learning rule 


ia 


In order to find the best remedy, it is essential first to understand the cause of 
the bad performance. Recall that in Chapter 5 we decomposed the true error of 
the learned predictor into approximation error and estimation error. The approx- 
imation error is defined to be Lp(h*) for some h* € argmin,.,, Lp(h), while the 
estimation error is defined to be Lp(hs) — Lp(h*), where hs is the learned predictor 
(which is based on the training set S). 

The approximation error of the class does not depend on the sample size or on 
the algorithm being used. It only depends on the distribution D and on the hypoth- 
esis class 1. Therefore, if the approximation error is large, it will not help us to 
enlarge the training set size, and it also does not make sense to reduce the hypoth- 
esis class. What can be beneficial in this case is to enlarge the hypothesis class or 
completely change it (if we have some alternative prior knowledge in the form of a 
different hypothesis class). We can also consider applying the same hypothesis class 
but on a different feature representation of the data (see Chapter 25). 
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The estimation error of the class does depend on the sample size. Therefore, 
if we have a large estimation error we can make an effort to obtain more training 
examples. We can also consider reducing the hypothesis class. However, it doesn’t 
make sense to enlarge the hypothesis class in that case. 


Error Decomposition Using Validation 

We see that understanding whether our problem is due to approximation error or 
estimation error is very useful for finding the best remedy. In the previous section we 
saw how to estimate Lp(hs) using the empirical risk on a validation set. However, 
it is more difficult to estimate the approximation error of the class. Instead, we 
give a different error decomposition, one that can be estimated from the train and 
validation sets. 


Lo(hs) =(Lo(hs) — Ly(hs)) + (Lv (hs) — Ls(hs)) +L s(hs). 


The first term, (Lp(hs)— Ly(hs)), can be bounded quite tightly using Theorem 11.1. 
Intuitively, when the second term, (Ly (hs) — Ls(hs)), is large we say that our algo- 
rithm suffers from “overfitting” while when the empirical risk term, Ls(hs), is large 
we say that our algorithm suffers from “underfitting.” Note that these two terms 
are not necessarily good estimates of the estimation and approximation errors. To 
illustrate this, consider the case in which H is a class of VC-dimension d, and D is 
a distribution such that the approximation error of 1 with respect to D is 1/4. As 
long as the size of our training set is smaller than d we will have Ls(hs) = 0 for every 
ERM hypothesis. Therefore, the training risk, L5(s), and the approximation error, 
Lp(h*), can be significantly different. Nevertheless, as we show later, the values of 
Ls(hs) and (Ly (hs) — Ls(hs)) still provide us useful information. 
Consider first the case in which Ls(hs) is large. We can write 


Ls(hs) = (Ls(hs) — Ls(h*)) + (Ls(h*) — Lo(h*)) + Lo(h*). 


When hs is an ERMy, hypothesis we have that Ls(is) — Ls(h*) < 0. In addition, 
since h* does not depend on S, the term (Ls5(h*) — Lp(h*)) can be bounded quite 
tightly (as in Theorem 11.1). The last term is the approximation error. It follows that 
if Ls(hs) is large then so is the approximation error, and the remedy to the failure 
of our algorithm should be tailored accordingly (as discussed previously). 


Remark 11.1. It is possible that the approximation error of our class is small, yet 
the value of Ls(hs) is large. For example, maybe we had a bug in our ERM imple- 
mentation, and the algorithm returns a hypothesis hs that is not an ERM. It may 
also be the case that finding an ERM hypothesis is computationally hard, and our 
algorithm applies some heuristic trying to find an approximate ERM. In some cases, 
it is hard to know how good hs is relative to an ERM hypothesis. But, sometimes it 
is possible at least to know whether there are better hypotheses. For example, in the 
next chapter we will study convex learning problems in which there are optimality 
conditions that can be checked to verify whether our optimization algorithm con- 
verged to an ERM solution. In other cases, the solution may depend on randomness 
in initializing the algorithm, so we can try different randomly selected initial points 
to see whether better solutions pop out. 


www.EngineeringBooksLibrary.com 


121 


122 


Model Selection and Validation 


Error, Error, 


Validation error ; 
Po°® oe? e? © Valid 


° ° eo 8 © 08 6 


Train error 


Train error 
S = S- S a > + a S = = S S as & S > 
m 


m 


Figure 11.1. Examples of learning curves. Left: This learning curve corresponds to the 
scenario in which the number of examples is always smaller than the VC dimension of the 
class. Right: This learning curve corresponds to the scenario in which the approximation 
error is zero and the number of examples is larger than the VC dimension of the class. 


Next consider the case in which Ls(hs) is small. As we argued before, this does 
not necessarily imply that the approximation error is small. Indeed, consider two 
scenarios, in both of which we are trying to learn a hypothesis class of VC-dimension 
d using the ERM learning rule. In the first scenario, we have a training set of m <d 
examples and the approximation error of the class is high. In the second scenario, 
we have a training set of m > 2d examples and the approximation error of the class 
is zero. In both cases Ls(hs) = 0. How can we distinguish between the two cases? 


Learning Curves 

One possible way to distinguish between the two cases is by plotting learning curves. 
To produce a learning curve we train the algorithm on prefixes of the data of increas- 
ing sizes. For example, we can first train the algorithm on the first 10% of the 
examples, then on 20% of them, and so on. For each prefix we calculate the training 
error (on the prefix the algorithm is being trained on) and the validation error (on 
a predefined validation set). Such learning curves can help us distinguish between 
the two aforementioned scenarios. In the first scenario we expect the validation 
error to be approximately 1/2 for all prefixes, as we didn’t really learn anything. 
In the second scenario the validation error will start as a constant but then should 
start decreasing (it must start decreasing once the training set size is larger than the 
VC-dimension). An illustration of the two cases is given in Figure 11.1. 

In general, as long as the approximation error is greater than zero we expect the 
training error to grow with the sample size, as a larger amount of data points makes 
it harder to provide an explanation for all of them. On the other hand, the validation 
error tends to decrease with the increase in sample size. If the VC-dimension is 
finite, when the sample size goes to infinity, the validation and train errors converge 
to the approximation error. Therefore, by extrapolating the training and validation 
curves we can try to guess the value of the approximation error, or at least to get a 
rough estimate on an interval in which the approximation error resides. 

Getting back to the problem of finding the best remedy for the failure of our 
algorithm, if we observe that Ls(hs) is small while the validation error is large, then 
in any case we know that the size of our training set is not sufficient for learning 
the class #1. We can then plot a learning curve. If we see that the validation error is 
starting to decrease then the best solution is to increase the number of examples (if 
we can afford to enlarge the data). Another reasonable solution is to decrease the 
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complexity of the hypothesis class. On the other hand, if we see that the validation 
error is kept around 1/2 then we have no evidence that the approximation error of 
H is good. It may be the case that increasing the training set size will not help us at 
all. Obtaining more data can still help us, as at some point we can see whether the 
validation error starts to decrease or whether the training error starts to increase. 
But, if more data is expensive, it may be better first to try to reduce the complexity 
of the hypothesis class. 
To summarize the discussion, the following steps should be applied: 


1. If learning involves parameter tuning, plot the model-selection curve to make 
sure that you tuned the parameters appropriately (see Section 11.2.3). 

2. If the training error is excessively large consider enlarging the hypothesis 
class, completely change it, or change the feature representation of the data. 

3. If the training error is small, plot learning curves and try to deduce from them 
whether the problem is estimation error or approximation error. 

4. If the approximation error seems to be small enough, try to obtain more data. 
If this is not possible, consider reducing the complexity of the hypothesis class. 

5. If the approximation error seems to be large as well, try to change the 
hypothesis class or the feature representation of the data completely. 


11.4 SUMMARY 


Model selection is the task of selecting an appropriate model for the learning task 
based on the data itself. We have shown how this can be done using the SRM learn- 
ing paradigm or using the more practical approach of validation. If our learning 
algorithm fails, a decomposition of the algorithm’s error should be performed using 
learning curves, so as to find the best remedy. 


11.5 EXERCISES 


11.1 Failure of k-fold cross validation Consider a case in that the label is chosen at ran- 
dom according to P[y = 1] = P[y = 0] = 1/2. Consider a learning algorithm that 
outputs the constant predictor h(x) = 1 if the parity of the labels on the training set 
is 1 and otherwise the algorithm outputs the constant predictor h(x) =0. Prove that 
the difference between the leave-one-out estimate and the true error in such a case 
is always 1/2. 


11.2 Let H1,...,H, be k hypothesis classes. Suppose you are given m i.i.d. training exam- 
ples and you would like to learn the class H = UKs Hi. Consider two alternative 
approaches: 


M™ Learn H on the m examples using the ERM rule 

™ Divide the m examples into a training set of size (1 — a)m and a validation 
set of size am, for some a € (0,1). Then, apply the approach of model selec- 
tion using validation. That is, first train each class H; on the (1 — a)m training 
examples using the ERM rule with respect to H;, and let in, 5 h x be the result- 
ing hypotheses. Second, apply the ERM rule with respect to the finite class 


{h1,..., 4x} on the wm validation examples. 
Describe scenarios in which the first method is better than the second and vice 
versa. 
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In this chapter we introduce convex learning problems. Convex learning comprises 
an important family of learning problems, mainly because most of what we can 
learn efficiently falls into it. We have already encountered linear regression with 
the squared loss and logistic regression, which are convex problems, and indeed 
they can be learned efficiently. We have also seen nonconvex problems, such as 
halfspaces with the 0-1 loss, which is known to be computationally hard to learn in 
the unrealizable case. 

In general, a convex learning problem is a problem whose hypothesis class is a 
convex set, and whose loss function is a convex function for each example. We begin 
the chapter with some required definitions of convexity. Besides convexity, we will 
define Lipschitzness and smoothness, which are additional properties of the loss 
function that facilitate successful learning. We next turn to defining convex learning 
problems and demonstrate the necessity for further constraints such as Bounded- 
ness and Lipschitzness or Smoothness. We define these more restricted families of 
learning problems and claim that Convex-Smooth/Lipschitz-Bounded problems are 
learnable. These claims will be proven in the next two chapters, in which we will 
present two learning paradigms that successfully learn all problems that are either 
convex-Lipschitz-bounded or convex-smooth-bounded. 

Finally, in Section 12.3, we show how one can handle some nonconvex problems 
by minimizing “surrogate” loss functions that are convex (instead of the original 
nonconvex loss function). Surrogate convex loss functions give rise to efficient 
solutions but might increase the risk of the learned predictor. 


2.1 CONVEXITY, LIPSCHITZNESS, AND SMOOTHNESS 


12.1.1 Convexity 


Definition 12.1 (Convex Set). A set C in a vector space is convex if for any two 
vectors u, v in C, the line segment between u and vis contained in C. That is, for any 
a € [0,1] we have that au+ (1—a)veC. 
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Examples of convex and nonconvex sets in R? are given in the following. For the 
nonconvex sets, we depict two points in the set such that the line between the two 
points is not contained in the set. 


Nonconvex Convex 


SY ON 


Given a € [0,1], the combination, au + (1 — @)v of the points u,v is called a 
convex combination. 


Definition 12.2 (Convex Function). Let C be a convex set. A function f :C > Ris 
convex if for every u,v € C anda € (0, 1], 


f(au+(1—a)v) < a f(u)+(—a) f(y). 


In words, f is convex if for any u, v, the graph of f between u and v lies below the 
line segment joining f(u) and f(y). An illustration of a convex function, f:R— R, 
is depicted in the following. 


af(u) + (1 - a) f(y) 


f(au+(1-a@)v) 


u 


Vv 
o 


“au + al —a)v 
The epigraph of a function f is the set 


epigraph(f) = {(x, 6): f(x) < A}. (12.1) 


It is easy to verify that a function f is convex if and only if its epigraph is a convex 
set. An illustration of a nonconvex function f : RR > R, along with its epigraph, is 
given in the following. 
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FQ) 


An important property of convex functions is that every local minimum of the 
function is also a global minimum. Formally, let B(u,r) = {v: ||v—ul| <r} be a 
ball of radius r centered around u. We say that f(u) is a local minimum of f at 
u if there exists some r > 0 such that for all v € B(u,r) we have f(v) > f(u). It 
follows that for any v (not necessarily in B), there is a small enough a > 0 such that 
u+a(v—u) € B(u,r) and therefore 


f(u) < f(u+a(v—u)). (12.2) 


If f is convex, we also have that 


f(ut+a(v—u)) = f(av+(1—a)u) < (1 —a@) fu) +a f(y). (12.3) 


Combining these two equations and rearranging terms, we conclude that f(u) < 
f (v). Since this holds for every v, it follows that f(u) is also a global minimum of f. 

Another important property of convex functions is that for every w we can 
construct a tangent to f at w that lies below f everywhere. If f is differen- 
tiable, this tangent is the linear function /(u) = f(w) + (V f(w),u — w), where 
V f(w) is the gradient of f at w, namely, the vector of partial derivatives of f, 


Vf (w)= (4m, aes ae) . That is, for convex differentiable functions, 


Vu, f(u)> f(w)+(Vf(w),u—w). (12.4) 


In Chapter 14 we will generalize this inequality to nondifferentiable functions. An 
illustration of Equation (12.4) is given in the following. 


If f is a scalar differentiable function, there is an easy way to check whether it is 
convex. 


Lemma 12.3. Let f :R— R bea scalar twice differential function, and let f', f” be 
its first and second derivatives, respectively. Then, the following are equivalent: 
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1. f is convex 
2. f’ is monotonically nondecreasing 
3. f” is nonnegative 


Example 12.1. 
™ The scalar function f(x) = x* is convex. To see this, note that f’(x) = 2x and 
f(x) =2>0. 
™ The scalar function f(x) = log(1 + exp(x)) is convex. To see this, observe that 
w= ew = Spay This is a monotonically increasing function since 


the exponent function is a monotonically increasing function. 


The following claim shows that the composition of a convex scalar function with 
a linear function yields a convex vector-valued function. 


Claim 12.4. Assume that f :R4 — R can be written as f (w) = g((w, x) + y), for some 
x<€R¢, ye R, and g:R—R. Then, convexity of g implies the convexity of f. 


Proof. Let w1,w2 € R¢ and a € [0, 1]. We have 


f (ows + (1 — a)w2) = g((awi + (1 —@)w2,x) + y) 
= g(a(w),x) + (1—a@)(w2,x) + y) 
= g(a((w1,x) + y)+ (1 —a)((w2,x) + y)) 
<ag((wi,x) + y)+ (1—a)g((w2,x) + y), 


where the last inequality follows from the convexity of g. O 
Example 12.2. 


™ Given some x € R¢ and y ER, let f : R¢ > R be defined as f(w) = ((w, x) — y)’. 
Then, f is a composition of the function g(a) = a? onto a linear function, and 
hence f is a convex function. 

™ Given some x € R@ and y € {+1}, let f : R’ > R be defined as f(w) = log(1+ 
exp(— y(w,x))). Then, f is a composition of the function g(a) = log (1+ exp(a)) 
onto a linear function, and hence f is a convex function. 


Finally, the following lemma shows that the maximum of convex functions is 
convex and that a weighted sum of convex functions, with nonnegative weights, is 
also convex. 


Claim 12.5. For i =1,...,r, let f; : R4 — R be a convex function. The following 
functions from R¢ to R are also convex. 


i 


g(x) = maxjey] fi(x) 
g(x) = 0-1 wi fix), where for alli, w; > 0. 


a 
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Proof. The first claim follows by 
g(au+(1—a)v) = max fi(au+ (1 —a)v) 
< max [a fi(w) + (1-2) fi(o)] 
< amax fi(u) + (1—a) max fi(v) 


=ag(u)+ (1 —a)g(v). 


For the second claim 


g(au+(1—a)v)= Dwi fi(au +(.=a)v) 
< ula fi) +a) iO] 
=a) Tui fiw) + (a) Vwi fi(v) 


= ag(u)+(1—a@)g(0). 
O 


Example 12.3. The function g(x) = |x| is convex. To see this, note that g(x) = 
max{x,—x} and that both the function f(x) =x and fo(x) = —x are convex. 


12.1.2 Lipschitzness 


The definition of Lipschitzness that follows is with respect to the Euclidean norm 
over R¢. However, it is possible to define Lipschitzness with respect to any norm. 


Definition 12.6 (Lipschitzness). Let C C R?. A function f : RR“ > R* is p-Lipschitz 
over C if for every w1, w2 € C we have that || f(w1) — f(w2)|| < oe ||wi — wall. 


Intuitively, a Lipschitz function cannot change too fast. Note that if f :R— R is 
differentiable, then by the mean value theorem we have 


f(w1) — f(w2) = fu) — v2), 


where u is some point between wy; and wz. It follows that if the derivative of f is 
everywhere bounded (in absolute value) by p, then the function is o-Lipschitz. 


Example 12.4. 


™ The function f(x) = |x| is 1-Lipschitz over R. This follows from the triangle 
inequality: For every x1, x2, 


|x| — |x2| = |x, — x2 + x2| — [x2] S< [x1 — x2| + [x2] — [x2] = [x1 — x2. 


Since this holds for both x1, x2 and x2, x1, we obtain that ||x1| — |x2|| < |x, — x2|. 


® The function f(x) = log(1+ exp(x)) is 1-Lipschitz over R. To see this, observe 
that (x) ‘ 
: exp (x 
f(x) =| P| = |__| < 
1+exp(x) exp(—x)+1 
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™ The function f(x) = x? is not p-Lipschitz over R for any p. To see this, take 
x, =O and x2 =1+9¢, then 
f (x2) — f(t) = (1+ ey” > p+) = plx2 — x11. 
However, this function is p-Lipschitz over the set C = {x : |x| < p/2}. Indeed, for 
any X1,xX2 € C we have 
lq — 931 = [x1 +212] [x1 — 2] S$ 2(/2) [a1 — x21 = plat — 22M. 
™ The linear function f : R4 — R defined by f(w) = (v,w) +b where v € R? is 
\|v||-Lipschitz. Indeed, using Cauchy-Schwartz inequality, 
If (wi) — f (wo) = (Vv, Wi — Wa)! < IIvil Ilwi — wall. 
The following claim shows that composition of Lipschitz functions preserves 


Lipschitzness. 


Claim 12.7. Let f(x) = g1(g2(x)), where g is p,-Lipschitz and go is p2-Lipschitz. 
Then, f is (p102)-Lipschitz. In particular, if go is the linear function, g(x) = (v,x) +b, 
for some v € R¢,b ER, then f is (1 ||v||)-Lipschitz. 


Proof. 


| f (wi) — f (w2)| = |g1(g2(wi)) — g1(g2(we))| 
< pillg2(wi) — g2(w2)I 


< p1 2 ||w1 — Wall. 


12.1.3 Smoothness 


The definition of a smooth function relies on the notion of gradient. Recall that the 
gradient of a differentiable function f : R“ > R at w, denoted V f(w), is the vector 
of partial derivatives of f, namely, V f(w) = (4m. wes Hf), 

Definition 12.8 (Smoothness). A differentiable function f : R¢ > R is B-smooth if 
its gradient is B-Lipschitz; namely, for all v, w we have ||V f(v)— V f(w)|| < Bllv—wll. 


It is possible to show that smoothness implies that for all v, w we have 


B 
2 
Recall that convexity of f implies that f(v) > f(w) + (Vf(w), v¥—w). Therefore, 
when a function is both convex and smooth, we have both upper and lower bounds 
on the difference between the function and its first order approximation. 

Setting v= w— aV f (w) in the right-hand side of Equation (12.5) and rearrang- 
ing terms, we obtain 


f(y) < f(w) + (Vf (w),.v—w) + Sllv—wIl. (12.5) 


SalV FOW)IP < FO") — FO). 
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If we further assume that f(v) > 0 for all v we conclude that smoothness implies the 
following: 


IV fOw)I? < 26 f(w). (12.6) 
A function that satisfies this property is also called a self-bounded function. 


Example 12.5. 


™ The function f(x) = x? is 2-smooth. This follows directly from the fact 
that f’(x) = 2x. Note that for this particular function Equation (12.5) and 
Equation (12.6) hold with equality. 

MH The function f(x) = log(1 + exp(x)) is (1/4)-smooth. Indeed, since f’(x) = 
we have that 


1+exp(—x) 


oe = : ys /4 


(i+exp(—x))? ~~ (i +exp(—x))0 + exp) 
Hence, f’ is (1/4)-Lipschitz. Since this function is nonnegative, Equation (12.6) 
holds as well. 


The following claim shows that a composition of a smooth scalar function over a 
linear function preserves smoothness. 


Claim 12.9. Let f(w) = g((w, x) +5), where g:R — Risa B-smooth function, x € R4, 
and b ER. Then, f is (B \|x||*)-smooth. 


Proof. By the chain rule we have that V f(w) = g’((w,x) + b)x, where g’ is the 
derivative of g. Using the smoothness of g and the Cauchy-Schwartz inequality we 
therefore obtain 


FW) =a((v,x) +5) 
< g((w,x) +b) +.9'((w,x) + )v—w.x) +E ((v —w.x))? 
< g((w, x) +b) + 9((w, x) +b)(v—w,x) + r (|v — wll |Ix|])? 
2 
= fom) +0vs0W).v— w+ Ey — wi? 


Example 12.6. 


For any x € R@ and y €R, let f(w) = ((w,x) — y)’. Then, f is (2||x||*)-smooth. 
m For any x € R@ and y € {+1}, let f(w) = log(1 + exp(— y(w,x))). Then, f is 
(||x||7/4)-smooth. 


a 


12.2 CONVEX LEARNING PROBLEMS 


Recall that in our general definition of learning (Definition 3.4 in Chapter 3), we 
have a hypothesis class H, a set of examples Z, and a loss function €:H x Z > R,. 
So far in the book we have mainly thought of Z as being the product of an instance 
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space and a target space, Z = V x Y, and H being a set of functions from 1 to 
Y. However, H can be an arbitrary set. Indeed, throughout this chapter, we con- 
sider hypothesis classes H that are subsets of the Euclidean space R“. That is, every 
hypothesis is some real-valued vector. We shall, therefore, denote a hypothesis in H 
by w. Now we can finally define convex learning problems: 


Definition 12.10 (Convex Learning Problem). A learning problem, (H, Z,€), is 
called convex if the hypothesis class H is a convex set and for all z € Z, the loss 
function, £(-,z), is a convex function (where, for any z, ¢(-,z) denotes the function 
f :H— R defined by f(w) = €(w, z)). 


Example 12.7 (Linear Regression with the Squared Loss). Recall that linear regres- 
sion is a tool for modeling the relationship between some “explanatory” variables 
and some real valued outcome (see Chapter 9). The domain set % is a subset of 
R¢, for some d, and the label set Y is the set of real numbers. We would like to 
learn a linear function h : R¢ — R that best approximates the relationship between 
our variables. In Chapter 9 we defined the hypothesis class as the set of homoge- 
nous linear functions, H = {x (w,x) : w € R“}, and used the squared loss function, 
£(h, (x, y)) = (h(x) — y)’. However, we can equivalently model the learning problem 
as a convex learning problem as follows. Each linear function is parameterized by a 
vector w € R¢. Hence, we can define H to be the set of all such parameters, namely, 
H =R4. The set of examples is Z = X x Y = R? x R= R**", and the loss function 
is €(w, (x, y)) = ((w,x) — y)*. Clearly, the set 1 is a convex set. The loss function is 
also convex with respect to its first argument (see Example 12.2). 


Lemma 12.11. /f € is a convex loss function and the class His convex, then the 
ERM, problem, of minimizing the empirical loss over H, is a convex optimization 
problem (that is, a problem of minimizing a convex function over a convex Set). 


Proof. Recall that the ERM, problem is defined by 
ERM?7(S) = argmin Ls(w). 
weH 


Since, for a sample S = z1,...,Zm, for every w, Ls(w) = 1 yy €(w, zi), Claim 12.5 
implies that Ls(w) is a convex function. Therefore, the ERM rule is a problem of 
minimizing a convex function subject to the constraint that the solution should be in 


a convex set. O 


Under mild conditions, such problems can be solved efficiently using generic 
optimization algorithms. In particular, in Chapter 14 we will present a very simple 
algorithm for minimizing convex functions. 


12.2.1 Learnability of Convex Learning Problems 


We have argued that for many cases, implementing the ERM rule for convex learn- 
ing problems can be done efficiently. But is convexity a sufficient condition for the 
learnability of a problem? 

To make the quesion more specific: In VC theory, we saw that halfspaces in d- 
dimension are learnable (perhaps inefficiently). We also argued in Chapter 9 using 
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the “discretization trick” that if the problem is of d parameters, it is learnable with 
a sample complexity being a function of d. That is, for a constant d, the problem 
should be learnable. So, maybe all convex learning problems over R“, are learnable? 

Example 12.8 later shows that the answer is negative, even when d is low. Not 
all convex learning problems over R¢ are learnable. There is no contradiction to 
VC theory since VC theory only deals with binary classification while here we con- 
sider a wide family of problems. There is also no contradiction to the “discretization 
trick” as there we assumed that the loss function is bounded and also assumed that 
a representation of each parameter using a finite number of bits suffices. As we will 
show later, under some additional restricting conditions that hold in many practical 
scenarios, convex problems are learnable. 


Example 12.8 (Nonlearnability of Linear Regression Even If d=1). Let =R, and 
the loss be the squared loss: £(w, (x, y)) = (wx — y)? (we’re referring to the homoge- 
nous case). Let A be any deterministic algorithm.'! Assume, by way of contradiction, 
that A is a successful PAC learner for this problem. That is, there exists a function 
m(-,-), such that for every distribution D and for every e, 6 if A receives a training set 
of size m > m(e, 4), it should output, with probability of at least 1 — 5, a hypothesis 
wW = A(S), such that Lp(w) — min, Lp(w) <e. 

Choose ¢ = 1/100, 5 = 1/2, let m > m(e, 5), and set uw = og 000/99) | We will define 
two distributions, and will show that A is likely to fail on at least one of them. The 
first distribution, D1, is supported on two examples, z1 = (1,0) and zz = (u,—1), 
where the probability mass of the first example is jx while the probability mass of the 
second example is 1 — w. The second distribution, D2, is supported entirely on z2. 

Observe that for both distributions, the probability that all examples of the train- 
ing set will be of the second type is at least 99%. This is trivially true for D2, whereas 
for D,, the probability of this event is 


(1—p)" > e77#™ — 0.99, 


Since we assume that A is a deterministic algorithm, upon receiving a training 
set of m examples, each of which is (uw, —1), the algorithm will output some w#. Now, 
if Ww < —1/(2), we will set the distribution to be D;. Hence, 


Lp, (@) = w(t)” = 1/(4u). 
However, 
min Lp, (w) < Lp, (0) =(1— pw). 
It follows that ‘ 
Lp, (6) — min Lp, (w) => ae (l-p)>e. 
w be 

Therefore, such algorithm A fails on D;. On the other hand, if w > —1/(2y) 
then we’ll set the distribution to be D2. Then we have that Lp,(w) > 1/4 while 
min, Lp,(w) = 0, so A fails on D2. In summary, we have shown that for every A 


there exists a distribution on which A fails, which implies that the problem is not 
PAC learnable. 


! Namely, given S the output of A is determined. This requirement is for the sake of simplicity. A slightly 
more involved argument will show that nondeterministic algorithms will also fail to learn the problem. 
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A possible solution to this problem is to add another constraint on the hypothesis 
class. In addition to the convexity requirement, we require that H will be bounded; 
namely, we assume that for some predefined scalar B, every hypothesis w « H 
satisfies ||w|| < B. 

Boundedness and convexity alone are still not sufficient for ensuring that the 
problem is learnable, as the following example demonstrates. 


Example 12.9, As in Example 12.8, consider a regression problem with the squared 
loss. However, this time let H = {w:|w| <1} C R be a bounded hypothesis class. It is 
easy to verify that H is convex. The argument will be the same as in Example 12.8, 
except that now the two distributions, D), D2 will be supported on z; = (1/,0) and 
z2 = (1,-1). If the algorithm A returns w < —1/2 upon receiving m examples of the 
second type, then we will set the distribution to be D; and have that 


Lp, (w) — min Lp, (w) > (w/w)? — Lp, (0) = 1/(4u) —(1— 4) >. 
Similarly, if « > —1/2 we will set the distribution to be D2 and have that 
Lp, (w) — min Lp,(w) > (—1/2+1)° -0>e. 


This example shows that we need additional assumptions on the learning prob- 
lem, and this time the solution is in Lipschitzness or smoothness of the loss function. 
This motivates a definition of two families of learning problems, convex-Lipschitz- 
bounded and convex-smooth-bounded, which are defined later. 


2 Convex-Lipschitz/Smooth-Bounded Learning Problems 


Definition 12.12 (Convex-Lipschitz-Bounded Learning Problem). A learning prob- 
lem, (H,Z,£), is called Convex-Lipschitz-Bounded, with parameters p, B if the 
following holds: 


ia 


The hypothesis class H. is a convex set and for all w € H we have ||w]| < B. 
For all z € Z, the loss function, ¢(-, z), is a convex and p-Lipschitz function. 


fai 


Example 12.10. Let ¥ = {x € R¢@: ||x|| < o} and V=R. Let H = {we R?: ||w]| < B} 
and let the loss function be ¢(w, (x, y)) =|(w, x) — y|. This corresponds to a regression 
problem with the absolute-value loss, where we assume that the instances are in a 
ball of radius p and we restrict the hypotheses to be homogenous linear functions 
defined by a vector w whose norm is bounded by B. Then, the resulting problem is 
Convex-Lipschitz-Bounded with parameters p, B. 


Definition 12.13 (Convex-Smooth-Bounded Learning Problem). A learning prob- 
lem, (H,Z,£), is called Convex-Smooth-Bounded, with parameters 6, B if the 
following holds: 


® The hypothesis class H is a convex set and for all w € H we have ||w|| < B. 
®™ For all z € Z, the loss function, ¢(-,z), is a convex, nonnegative, and 6-smooth 
function. 
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hinge 


y (w, x) 


Once we have defined the surrogate convex loss, we can learn the problem with 
respect to it. The generalization requirement from a hinge loss learner will have the 
form 


Line’A(S)) < min Lee (w) +e, 
weH. 


where i (w) =Eqy)~p [ebinge (w, (x, y))]. Using the surrogate property, we can 
lower bound the left-hand side by L9>1(A(S)), which yields 


L51(A(S)) < minLB"*(w) +e. 
weH 


We can further rewrite the upper bound as follows: 
0-1 + 70-1 :_ 7 hinge min 7 0-1 
Ly (A(S)) < min Ly (w)+ (nin Ly” (w) min Ly )) +e. 
That is, the 0—1 error of the learned predictor is upper bounded by three terms: 


® Approximation error: This is the term minywey L°>1(w), which measures how 
well the hypothesis class performs on the distribution. We already elaborated 
on this error term in Chapter 5. 

® Estimation error: This is the error that results from the fact that we only receive 
a training set and do not observe the distribution D. We already elaborated on 
this error term in Chapter 5. 

© Optimization error: This is the term (minwew LBM’ (w) — Minwex LS ‘(w)) that 
measures the difference between the approximation error with respect to the 
surrogate loss and the approximation error with respect to the original loss. 
The optimization error is a result of our inability to minimize the training loss 
with respect to the original loss. The size of this error depends on the specific 
distribution of the data and on the specific surrogate loss we are using. 


12.4 SUMMARY 


We introduced two families of learning problems: convex-Lipschitz-bounded and 
convex-smooth-bounded. In the next two chapters we will describe two generic 
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learning algorithms for these families. We also introduced the notion of convex 
surrogate loss function, which enables us also to utilize the convex machinery for 
nonconvex problems. 


12.5 BIBLIOGRAPHIC REMARKS 


There are several excellent books on convex analysis and optimization (Boyd & 
Vandenberghe 2004, Borwein & Lewis 2006, Bertsekas 1999, Hiriart-Urruty & 
Lemaréchal 1993). Regarding learning problems, the family of convex-Lipschitz- 
bounded problems was first studied by Zinkevich (2003) in the context of online 
learning and by Shalev-Shwartz, Shamir, Sridharan, and Srebro ((2009)) in the 
context of PAC learning. 


12.6 EXERCISES 


12.1 Construct an example showing that the 0—1 loss function may suffer from local 
minima; namely, construct a training sample S € (X x {+1})” (say, for X = R), for 
which there exist a vector w and some € > 0 such that 

1. For any w’ such that ||w —w’|| < € we have Ls(w) < Ls(w’) (where the loss here 
is the O—1 loss). This means that w is a local minimum of Ls. 

2. There exists some w* such that Ls(w*) < Ls(w). This means that w is not a 
global minimum of Ls. 

12.2 Consider the learning problem of logistic regression: Let H = V = {x € R@: ||x|| < B}, 
for some scalar B > 0, let Y = {+1}, and let the loss function € be defined as 
£(w, (x, y)) = log(1 + exp( — y(w,x))). Show that the resulting learning prob- 
lem is both convex-Lipschitz-bounded and convex-smooth-bounded. Specify the 
parameters of Lipschitzness and smoothness. 

12.3 Consider the problem of learning halfspaces with the hinge loss. We limit our 
domain to the Euclidean ball with radius R. That is, Y = {x: ||x||2 < R}. The label set 
is Y = {+1} and the loss function ¢ is defined by ¢(w, (x, y)) = max{0, 1 — y(w, x)}. 
We already know that the loss function is convex. Show that it is R-Lipschitz. 

12.4 (*) Convex-Lipschitz-Boundedness Is Not Sufficient for Computational Efficiency: 
In the next chapter we show that from the statistical perspective, all convex- 
Lipschitz-bounded problems are learnable (in the agnostic PAC model). However, 
our main motivation to learn such problems resulted from the computational per- 
spective — convex optimization is often efficiently solvable. Yet the goal of this 
exercise is to show that convexity alone is not sufficient for efficiency. We show 
that even for the case d = 1, there is a convex-Lipschitz-bounded problem which 
cannot be learned by any computable learner. 

Let the hypothesis class be H = [0, 1] and let the example domain, Z, be the set of 
all Turing machines. Define the loss function as follows. For every Turing machine 
T €Z, let £(0, T) =1 if T halts on the input 0 and £(0, T) = 0 if T doesn’t halt on the 
input 0. Similarly, let (1, 7) =O if T halts on the input 0 and @(1, 7) =1 if T doesn’t 
halt on the input 0. Finally, for h € (0,1), let £(4, T) = he(0,T)+(1—A)e(1, T). 

1. Show that the resulting learning problem is convex-Lipschitz-bounded. 
2. Show that no computable algorithm can learn the problem. 
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In the previous chapter we introduced the families of convex-Lipschitz-bounded 
and convex-smooth-bounded learning problems. In this section we show that all 
learning problems in these two families are learnable. For some learning problems 
of this type it is possible to show that uniform convergence holds; hence they are 
learnable using the ERM rule. However, this is not true for all learning problems of 
this type. Yet, we will introduce another learning rule and will show that it learns all 
convex-Lipschitz-bounded and convex-smooth-bounded learning problems. 

The new learning paradigm we introduce in this chapter is called Regularized 
Loss Minimization, or RLM for short. In RLM we minimize the sum of the empirical 
risk and a regularization function. Intuitively, the regularization function measures 
the complexity of hypotheses. Indeed, one interpretation of the regularization func- 
tion is the structural risk minimization paradigm we discussed in Chapter 7. Another 
view of regularization is as a stabilizer of the learning algorithm. An algorithm is 
considered stable if a slight change of its input does not change its output much. We 
will formally define the notion of stability (what we mean by “slight change of input” 
and by “does not change much the output”) and prove its close relation to learnabil- 
ity. Finally, we will show that using the squared €2 norm as a regularization function 
stabilizes all convex-Lipschitz or convex-smooth learning problems. Hence, RLM 
can be used as a general learning rule for these families of learning problems. 


13.1 REGULARIZED LOSS MINIMIZATION 


Regularized Loss Minimization (RLM) is a learning rule in which we jointly min- 
imize the empirical risk and a regularization function. Formally, a regularization 
function is a mapping R: R¢ — R, and the regularized loss minimization rule outputs 
a hypothesis in 
argmin (Ls(w) + R(w)). (13.1) 
Ww 


Regularized loss minimization shares similarities with minimum description length 


algorithms and structural risk minimization (see Chapter 7). Intuitively, the “com- 
plexity” of hypotheses is measured by the value of the regularization function, and 
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the algorithm balances between low empirical risk and “simpler,” or “less complex,” 
hypotheses. 

There are many possible regularization functions one can use, reflecting some 
prior belief about the problem (similarly to the description language in Minimum 
Description Length). Throughout this section we will focus on one of the most sim- 
ple regularization functions: R(w) = A||w||*, where 4 > 0 is a scalar and the norm is 


the 2 norm, ||w|| = x w?. This yields the learning rule: 


A(S) = argmin (Ls(w) Ba Awl?) (13.2) 


This type of regularization function is often called Tikhonov regularization. 

As mentioned before, one interpretation of Equation (13.2) is using structural 
risk minimization, where the norm of w is a measure of its “complexity.” Recall that 
in the previous chapter we introduced the notion of bounded hypothesis classes. 
Therefore, we can define a sequence of hypothesis classes, H; C H2 C H3..., where 
H; ={w:||w\l2 <7}. If the sample complexity of each H; depends oni then the RLM 
rule is similar to the SRM rule for this sequence of nested classes. 

A different interpretation of regularization is as a stabilizer. In the next section 
we define the notion of stability and prove that stable learning rules do not overfit. 
But first, let us demonstrate the RLM rule for linear regression with the squared 
loss. 


1.1 Ridge Regression 


Applying the RLM rule with Tikhonov regularization to linear regression with the 
squared loss, we obtain the following learning rule: 


: 1 m 1 
argmin (uwid+ 23 simon) (13.3) 
weR¢ i i=1 


Performing linear regression using Equation (13.3) is called ridge regression. 
To solve Equation (13.3) we compare the gradient of the objective to zero and 
obtain the set of linear equations 


(2AmI + A)w=b, 


where / is the identity matrix and A,b are as defined in Equation (9.6), namely, 


m m 
A= (s “) and b=) yix;. (13.4) 
i=] i=1 


Since A is a positive semidefinite matrix, the matrix 2Am/ + A has all its eigenvalues 
bounded below by 24m. Hence, this matrix is invertible and the solution to ridge 
regression becomes 

w= (2amI +A) 'b. (13.5) 


In the next section we formally show how regularization stabilizes the algorithm 
and prevents overfitting. In particular, the analysis presented in the next sections 
(particularly, Corollary 13.11) will yield: 
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On the other hand, we can write 


E[Ls(A(S))] = ELC(A(S), zi) 


Combining the two equations we conclude our proof. oO 


When the right-hand side of Equation (13.6) is small, we say that A is a sta- 
ble algorithm — changing a single example in the training set does not lead to a 
significant change. Formally, 


Definition 13.3 (On-Average-Replace-One-Stable). Let « : N > R be a monotoni- 
cally decreasing function. We say that a learning algorithm A is on-average-replace- 
one-stable with rate «(m) if for every distribution D 


[2(A(S, zi)) — €(A(S), z1)] < €(m). 


(S,z/)~D™+1 i~U(m) 


Theorem 13.2 tells us that a learning algorithm does not overfit if and only if 
it is on-average-replace-one-stable. Of course, a learning algorithm that does not 
overfit is not necessarily a good learning algorithm — take, for example, an algo- 
rithm A that always outputs the same hypothesis. A useful algorithm should find 
a hypothesis that on one hand fits the training set (i.e., has a low empirical risk) 
and on the other hand does not overfit. Or, in light of Theorem 13.2, the algorithm 
should both fit the training set and at the same time be stable. As we shall see, the 
parameter 4 of the RLM rule balances between fitting the training set and being 
stable. 


13.3 TIKHONOV REGULARIZATION AS A STABILIZER 


In the previous section we saw that stable rules do not overfit. In this section we 
show that applying the RLM rule with Tikhonov regularization, d||w||?, leads to a 
stable algorithm. We will assume that the loss function is convex and that it is either 
Lipschitz or smooth. 

The main property of the Tikhonov regularization that we rely on is that it makes 
the objective of RLM strongly convex, as defined in the following. 


Definition 13.4 (Strongly Convex Functions). A function f is A-strongly convex if 
for all w, u, anda € (0, 1) we have 


Flaw+(1—a)u) <af(w) + (12) F(w) — Sa —a)|w—ul? 


Clearly, every convex function is 0-strongly convex. An illustration of strong 
convexity is given in the following figure. 
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a(1—@)|ju-wif? 


Ww u 


aw+(l-a@)u 


The following lemma implies that the objective of RLM is (24)-strongly convex. 
In addition, it underscores an important property of strong convexity. 


Lemma 13.5. 


1. The function f (w) = i||w\l? is 24-strongly convex. 
2. If f is 4-strongly convex and g is convex, then f + g is 4-strongly convex. 
3. If f is d-strongly convex and w is a minimizer of f, then, for any w, 


Fw) ~ fu) = Siw — ul? 


Proof. The first two points follow directly from the definition. To prove the last 
point, we divide the definition of strong convexity by a and rearrange terms to get 
that 


f(u+e(w—u))— fu) 


a 


< f(w)- fw) - F(a) wal, 


Taking the limit « + 0 we obtain that the right-hand side converges to f(w)— f(u)— 
Sllw —ul|?. On the other hand, the left-hand side becomes the derivative of the 
function g(a) = f(u+a(w—u)) at wa =0. Since u is a minimizer of f, it follows that 
a = 0 is a minimizer of g, and therefore the left-hand side of the preceding goes to 
zero in the limit a > 0, which concludes our proof. O 


We now turn to prove that RLM is stable. Let S = (z1,...,Zm) be a training set, 
let z’ be an additional example, and let SOx (Z1,---,Zi-1, 2s Zit1s-+++Zm). Let A be 
the RLM rule, namely, 


A(S) = argmin (Ls(w) + Allwil?). 


Denote fs(w) = Ls(w) + Aljw||?, and on the basis of Lemma 13.5 we know that fs is 
(24)-strongly convex. Relying on part 3 of the lemma, it follows that for any v, 


fs(v) — fs(A(S)) = Ally — A(S)II?. (a7) 
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On the other hand, for any v and u, and for alli, we have 


fs(v) — fs(u) = Ls(v) + Allyl? — (Ls(u) + Aljull) (13.8) 
= Liy(v) +Al|vIl? — (Lee (a) + Allull?) 


re £(v, z;) — €(u, z;) ie £(u, 2’) — e(y,z') 
m m 


In particular, choosing v = A(S“)), u = A(S), and using the fact that v minimizes 
L sii (Ww) + Allw||?, we obtain that 


£(A(S®), z;) — £(A(S), zi) i" £(A(S), 2’) — £(A(S), z’) 


m 


fs(A(S)) — fs(A(S)) < 
(13.9) 


Combining this with Equation (13.7) we obtain that 


(A(S), z;) — €(A(S), zi) i £(A(S), 2’) — €(A(S®), z’) 


m 


A A(S) — A(S)II? < 
(13.10) 


The two subsections that follow continue the stability analysis for either Lip- 
schitz or smooth loss functions. For both families of loss functions we show that 
RLM is stable and therefore it does not overfit. 


13.3.1 Lipschitz Loss 
If the loss function, €(-, z;), is e-Lipschitz, then by the definition of Lipschitzness, 
£(A(S), zi) — €(A(S), zi)  pIA(S) — A(S)IL (13.11) 


Similarly, 
e(A(S), 2’) — £(A(S®), 2) < p A(S®) — A(S)IL. 


Plugging these inequalities into Equation (13.10) we obtain 


14(8) — (gyi? < 22/469) - 4 


which yields 
i 2p 
JAS) — ACSI < —. 
m 


Plugging the preceding back into Equation (13.11) we conclude that 


e(A(S!), 23) (ACS). 21) 5 2, 


Since this holds for any S, z’, i we immediately obtain: 


Corollary 13.6. Assume that the loss function is convex and p-Lipschitz. Then, the 


2 
RLM rule with the regularizer A\|w||? is on-average-replace-one-stable with rate ta 
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It follows (using Theorem 13.2) that 


[Lp(A(S)) — Ls(A(S))] s oe 


S~Dm 


13.3.2 Smooth and Nonnegative Loss 


If the loss is B-smooth and nonnegative then it is also self-bounded (see 
Section 12.1): 


IV F(w)II? < 26 f(w). (13.12) 


We further assume that A > 2B or, in other words, that 6 < 4m/2. By the smoothness 
assumption we have that 


£(A(S), 21) — (ACS), 21) 5 (VECA(S), 21), A(S)— ACS) + FAS) — ACSI, 
(13.13) 


Using the Cauchy-Schwartz inequality and Equation (12.6) we further obtain that 


e(A(S®), z;) — £(A(S), zi) 


B 


< IVE(A(S), zi)IA(S) = ACS)IL-+ SIA) = ACS)IP 


i B i 
< V2BE(A(S), zi) |A(S) — ACS) + FIA) — A(S)IP. (13.14) 
By asymmetric argument it holds that 
e(A(S), 2) — e(A(S), z’) 


< y/2Be(A(8),2)1A(S) — ACS)I + 5 1A(S) — AGSYIP. 


Plugging these inequalities into Equation (13.10) and rearranging terms we obtain 
that 


| A(S) — A(S)|| < ar ( VaSad+ eas),2)) ; 


Combining the preceding with the assumption 8 < 4m/2 yields 


1A(S)— (sy < OP ( VaaGy, 7 J+ fea(s),2)). 
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Combining the preceding with Equation (13.14) and again using the assumption 
B <dm/2 yield 


€(A(S®), z;) — £(A(S), zi) 


< V2pE(A(S), zi) NA(S) — A(S)II+ 5 Pi a(s)— a(s)IP 


($5) oe) 
2 == £(A(S), Zi) }+ fAAa0s.2)) 


a ~ (e(A(S), 21) + (A(S),2/)), 


where in the last step we used the inequality (a+b)? < 3(a* +b’). Taking expec- 
tation with respect to S, z’, i and noting that E[¢(A(S), z;)] = E[¢(A(S®), z’)] = 
U[Ls(A(S))], we conclude that: 


Corollary 13.7. Assume that the loss function is B-smooth and nonnegative. Then, the 
RLM rule with the regularizer i||w||?, where 4 > 28 satisfies 


=m? 


— 188, 
=m 


3 [e(A(S), z;) — €(A(S), 2:)] $ ——E[Ls(A(S))} 


Note that if for all z we have £(0, z) < C, for some scalar C > 0, then for every S, 
Ls(A(S)) < Ls(A(S)) + Al|A(S)I? < Ls (0) +|10|? = Ls) <C. 


Hence, Corollary 13.7 also implies that 


[eca(s), 21) -e(A(8).2i)] s SES, 


13.4 CONTROLLING THE FITTING-STABILITY TRADEOFF 


We can rewrite the expected risk of a learning algorithm as 


E [Lp(A(S))] = E[Ls(A(S))] + E[L(A(S)) — Ls(A(S))] (13.15) 


The first term reflects how well A(S) fits the training set while the second term 
reflects the difference between the true and empirical risks of A(S). As we have 
shown in Theorem 13.2, the second term is equivalent to the stability of A. Since 
our goal is to minimize the risk of the algorithm, we need that the sum of both terms 
will be small. 

In the previous section we have bounded the stability term. We have shown 
that the stability term decreases as the regularization parameter, i, increases. On 
the other hand, the empirical risk increases with A. We therefore face a tradeoff 
between fitting and overfitting. This tradeoff is quite similar to the bias-complexity 
tradeoff we discussed previously in the book. 
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following for all w*: 


E[Ln(a(s))] = (142) etus(a(syy] s (1454) (Lotw")-+ aw"). 


S 


For example, if we choose A = 488 we obtain from the preceding that the 


expected true risk of A(S) is at most twice the expected empirical risk of A(S). 
Furthermore, for this value of 4, the expected empirical risk of A(S) is at most 
Lp(w*)+ Be I|w* ||. 

We can also derive a learnability guarantee for convex-smooth-bounded learn- 
ing problems based on Corollary 13.10. 


Corollary 13.11. Let (H, Z,¢) be a convex-smooth-bounded learning problem with 
parameters 8, B. Assume in addition that €(0,z) <1 for all z € Z. For any € € (0,1) 


letm> 1505" and set = €/(3B”). Then, for every distribution D, 


E[Lp(A(S))] < min Lo(w) +e. 


13 


.5 SUMMARY 


We introduced stability and showed that if an algorithm is stable then it does not 
overfit. Furthermore, for convex-Lipschitz-bounded or convex-smooth-bounded 
problems, the RLM rule with Tikhonov regularization leads to a stable learn- 
ing algorithm. We discussed how the regularization parameter, A, controls the 
tradeoff between fitting and overfitting. Finally, we have shown that all learning 
problems that are from the families of convex-Lipschitz-bounded and convex- 
smooth-bounded problems are learnable using the RLM rule. The RLM paradigm 
is the basis for many popular learning algorithms, including ridge regression (which 
we discussed in this chapter) and support vector machines (which will be discussed 
in Chapter 15). 

In the next chapter we will present Stochastic Gradient Descent, which gives 
us a very practical alternative way to learn convex-Lipschitz-bounded and convex- 
smooth-bounded problems and can also be used for efficiently implementing the 
RLM rule. 


13.6 BIBLIOGRAPHIC REMARKS 


Stability is widely used in many mathematical contexts. For example, the necessity 
of stability for so-called inverse problems to be well posed was first recognized by 
Hadamard (1902). The idea of regularization and its relation to stability became 
widely known through the works of Tikhonov (1943) and Phillips (1962). In the 
context of modern learning theory, the use of stability can be traced back at least to 
the work of Rogers and Wager (1978), which noted that the sensitivity of a learning 
algorithm with regard to small changes in the sample controls the variance of the 
leave-one-out estimate. The authors used this observation to obtain generalization 
bounds for the k-nearest neighbor algorithm (see Chapter 19). These results were 
later extended to other “local” learning algorithms (see Devroye, GyGrfi & Lugosi 
(1996) and references therein). In addition, practical methods have been developed 
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to introduce stability into learning algorithms, in particular the Bagging technique 
introduced by (Breiman 1996). 

Over the last decade, stability was studied as a generic condition for learnability. 
See (Kearns & Ron 1999, Bousquet & Elisseeff 2002, Kutin & Niyogi 2002, Rakhlin, 
Mukherjee & Poggio 2005, Mukherjee, Niyogi, Poggio & Rifkin 2006). Our presen- 
tation follows the work of Shalev-Shwartz, Shamir, Srebo, and Sridharan (2010), 
who showed that stability is sufficient and necessary for learning. They have also 
shown that all convex-Lipschitz-bounded learning problems are learnable using 
RLM, even though for some convex-Lipschitz-bounded learning problems uniform 
convergence does not hold in a strong sense. 


13.7 EXERCISES 


13.1 From Bounded Expected Risk to Agnostic PAC Learning: Let A be an algorithm 
that guarantees the following: If m > m,(e) then for every distribution D it holds 
that 


~ [L(A st olh . 
jdm t D( (S))] < min p(h)+e 


® Show that for every 6 € (0,1), if m > m,(€6) then with probability of at least 
1—6 it holds that Lp(A(S)) < minney Lp(h) +e. 
Hint: Observe that the random variable Lp(A(S)) — minyex Lp(h) is nonneg- 
ative and rely on Markov’s inequality. 

For every 6 € (0, 1) let 


my(€, 8) =mry(e/2) flog, (1/8)] + eater en 


e2 


Suggest a procedure that agnostic PAC learns the problem with sample com- 
plexity of mz,(e, 5), assuming that the loss function is bounded by 1. 
Hint: Let k = [log, (1/5)]. Divide the data into k+ 1 chunks, where each of the 
first k chunks is of size m7,(€/2) examples. Train the first k chunks using A. On 
the basis of the previous question argue that the probability that for all of these 
chunks we have Lp(A(S)) > minyex Lp(h) +€ is at most 2~* < 6/2. Finally, use 
the last chunk as a validation set. 

13.2 Learnability without Uniform Convergence: Let 5 be the unit ball of R@ let =B, 

let Z =B x {0, 1}¢, and let £: Z x H — R be defined as follows: 


d 
L(w, (x, a)) = So ai(xi = wi)’. 


i=1 


This problem corresponds to an unsupervised learning task, meaning that we do 
not try to predict the label of x. Instead, what we try to do is to find the “center of 
mass” of the distribution over B. However, there is a twist, modeled by the vectors 
a. Each example is a pair (x,a), where x is the instance x and & indicates which 
features of x are “active” and which are “turned off.” A hypothesis is a vector 
w representing the center of mass of the distribution, and the loss function is the 
squared Euclidean distance between x and w, but only with respect to the “active” 
elements of x. 

Show that this problem is learnable using the RLM rule with a sample 

complexity that does not depend on d. 
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™ Consider a distribution D over Z as follows: x is fixed to be some xg, and each 


element of a is sampled to be either 1 or 0 with equal probability. Show that 
the rate of uniform convergence of this problem grows with d. 

Hint: Let m be a training set size. Show that if d >> 2”, then there is a high 
probability of sampling a set of examples such that there exists some j € [d] for 
which a; = 1 for all the examples in the training set. Show that such a sample 
cannot be €-representative. Conclude that the sample complexity of uniform 
convergence must grow with log (d). 

Conclude that if we take d to infinity we obtain a problem that is learnable but 
for which the uniform convergence property does not hold. Compare to the 
fundamental theorem of statistical learning. 


13.3 Stability and Asymptotic ERM Are Sufficient for Learnability: 
We say that a learning rule A is an AERM (Asymptotic Empirical Risk Minimizer) 
with rate e(m) if for every distribution D it holds that 


5B [Es(A(S)) ~ mine stay] < eC) 


We say that a learning rule A learns a class H with rate €(m) if for every distribution 
D it holds that 


Eu [EOCACS)) — min n(a)] < tm) 


Prove the following: 


Theorem 13.12. [fa learning algorithm A is on-average-replace-one-stable with rate 
€1(m) and is an AERM with rate €2(m), then it learns H with rate €,(m) + €2(m). 


13.4 Strong Convexity with Respect to General Norms: 
Throughout the section we used the £2 norm. In this exercise we generalize some 
of the results to general norms. Let || - || be some arbitrary norm, and let f be a 
strongly convex function with respect to this norm (see Definition 13.4). 


1. 
2. 
3. 


4. 


Show that items 2-3 of Lemma 13.5 hold for every norm. 

(*) Give an example of a norm for which item 1 of Lemma 13.5 does not hold. 
Let R(w) be a function that is (2A)-strongly convex with respect to some norm 
|| - ||. Let A be an RLM rule with respect to R, namely, 


A(S) = argmin (Ls5(w)+ R(w)). 


Assume that for every z, the loss function ¢(-,z) is p-Lipschitz with respect to 
the same norm, namely, 


Vz, Vw, v, €(w, z)—£&(v, z) <p llw—vl. 


22 


Prove that A is on-average-replace-one-stable with rate +. 


(*) Let q € (1, 2) and consider the £,-norm 


d 1/q 
IIWllg = (Scie 
i=1 
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It can be shown (see, for example, Shalev-Shwartz (2007)) that the function 


at 
R(w) = ———-||w ||? 
(#) = sym 
is 1-strongly convex with respect to ||w||,. Show that if g = my then R(w) is 


(acim) -strongly convex with respect to the ¢; norm over R?. 
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Recall that the goal of learning is to minimize the risk function, Lp(h) = 
i-~p [€(h, z)]. We cannot directly minimize the risk function since it depends on 
the unknown distribution D. So far in the book, we have discussed learning meth- 
ods that depend on the empirical risk. That is, we first sample a training set S and 
define the empirical risk function Ls(h). Then, the learner picks a hypothesis based 
on the value of Ls(h). For example, the ERM rule tells us to pick the hypothesis 
that minimizes Ls(h) over the hypothesis class, H. Or, in the previous chapter, we 
discussed regularized risk minimization, in which we pick a hypothesis that jointly 
minimizes Ls(h) and a regularization function over h. 

In this chapter we describe and analyze a rather different learning approach, 
which is called Stochastic Gradient Descent (SGD). As in Chapter 12 we will focus 
on the important family of convex learning problems, and following the notation 
in that chapter, we will refer to hypotheses as vectors w that come from a convex 
hypothesis class, H. In SGD, we try to minimize the risk function Lp(w) directly 
using a gradient descent procedure. Gradient descent is an iterative optimization 
procedure in which at each step we improve the solution by taking a step along the 
negative of the gradient of the function to be minimized at the current point. Of 
course, in our case, we are minimizing the risk function, and since we do not know 
D we also do not know the gradient of Lp(w). SGD circumvents this problem by 
allowing the optimization procedure to take a step along a random direction, as 
long as the expected value of the direction is the negative of the gradient. And, as 
we shall see, finding a random direction whose expected value corresponds to the 
gradient is rather simple even though we do not know the underlying distribution D. 

The advantage of SGD, in the context of convex learning problems, over the 
regularized risk minimization learning rule is that SGD is an efficient algorithm that 
can be implemented in a few lines of code, yet still enjoys the same sample complex- 
ity as the regularized risk minimization rule. The simplicity of SGD also allows us 
to use it in situations when it is not possible to apply methods that are based on the 
empirical risk, but this is beyond the scope of this book. 

We start this chapter with the basic gradient descent algorithm and analyze its 
convergence rate for convex-Lipschitz functions. Next, we introduce the notion of 
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subgradient and show that gradient descent can be applied for nondifferentiable 
functions as well. The core of this chapter is Section 14.3, in which we describe 
the Stochastic Gradient Descent algorithm, along with several useful variants. We 
show that SGD enjoys an expected convergence rate similar to the rate of gradient 
descent. Finally, we turn to the applicability of SGD to learning problems. 


14.1 GRADIENT DESCENT 


Before we describe the stochastic gradient descent method, we would like to 
describe the standard gradient descent approach for minimizing a differentiable 
convex function f(w). 

The gradient of a differentiable function f : R¢ > R at w, denoted V f(w), is 


the vector of partial derivatives of f, namely, V f(w) = (¢ OY cs sit). Gradient 


dw{1] ’ > dwl[d] 
descent is an iterative algorithm. We start with an initial value of w (say, w“) = 0). 
Then, at each iteration, we take a step in the direction of the negative of the gradient 
at the current point. That is, the update step is 


wit) = wh) nv f (w"?), (14.1) 


where 7 > 0 is a parameter to be discussed later. Intuitively, since the gradient 
points in the direction of the greatest rate of increase of f around w"), the algo- 
rithm makes a small step in the opposite direction, thus decreasing the value of the 
function. Eventually, after T iterations, the algorithm outputs the averaged vector, 
w= + y7/_, w). The output could also be the last vector, w'"), or the best perform- 
ing vector, argmin,<(r] f (w")), but taking the average turns out to be rather useful, 
especially when we generalize gradient descent to nondifferentiable functions and 
to the stochastic case. 

Another way to motivate gradient descent is by relying on Taylor approxima- 
tion. The gradient of f at w yields the first order Taylor approximation of f around 
w by f(u) © f(w) + (u— w, V f(w)). When f is convex, this approximation lower 
bounds f, that is, 


f(u) = f(w)+ (u—w, V f(w)). 


Therefore, for w close to w“) we have that f(w) ~ f(w”) + (w— w®, V f(w)). 
Hence we can minimize the approximation of f(w). However, the approximation 
might become loose for w, which is far away from w). Therefore, we would like to 
minimize jointly the distance between w and w"? and the approximation of f around 
w"), If the parameter 7 controls the tradeoff between the two terms, we obtain the 
update rule 


1 
w() = argmin = w—w |? +n (FW) + (w—w, V fw). 


Solving the preceding by taking the derivative with respect to w and comparing it to 
zero yields the same update rule as in Equation (14.1). 
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Figure 14.1. An illustration of the gradient descent algorithm. The function to be 
minimized is 1.25(x; +6)? + (x2 — 8)’. 


14.1.1 Analysis of GD for Convex-Lipschitz Functions 


To analyze the convergence rate of the GD algorithm, we limit ourselves to the case 
of convex-Lipschitz functions (as we have seen, many problems lend themselves 
easily to this setting). Let w* be any vector and let B be an upper bound on ||w*||. It 
is convenient to think of w* as the minimizer of f(w), but the analysis that follows 
holds for every w*. 

We would like to obtain an upper bound on the suboptimality of our solu- 
tion with respect to w*, namely, f(w) — f(w*), where w = ew). From the 
definition of w, and using Jensen’s inequality, we have that 


T 
f(w)- fw)=f (+ yw) — f(w*) 


< 5 (Fm) — For") 


t=1 


=2 >> (Fw) — fw’). (14.2) 


t=1 


For every f, because of the convexity of f, we have that 
fw) — fiw) < (WO —w*, Vv f(w)). (14.3) 


Combining the preceding we obtain 


r 


£8) — FO") = Zw!) wv fw!) 


t=1 


To bound the right-hand side we rely on the following lemma: 


Lemma 14.1. Letv1,..., vr be an arbitrary sequence of vectors. Any algorithm with 
an initialization w") = 0 and an update rule of the form 


wit) — wl _ ny, (14.4) 
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satisfies 


Sow —w*,v,) < all + a IIvi(I°. (14.5) 
a 2 


In Et aes for every B,p > 0, if for all t we have that ||v;|| < p and if we set n = 


coe then for every w* with ||w*|| < B we have 


(= Jw — wt — nell? + Iw — well? + 7 Ive?) 


1 * +2) 4.2 
= (= wt — wit? + Iw — wil?) + silva’ 
UT] 2 


where the last equality follows from the definition of the update rule. Summing the 
equality over t, we have 


T T EE: 
1 * 1 
dow — wt vr) => AC Iw) — wi? + jw wr?) +29 vill’. 14.6) 
t=1 t=1 


The first sum on the right-hand side is a telescopic sum that collapses to 


Plugging this in Equation (14.6), we have 


T T 


* 1 * u 
DH) — whey) = 5 (lw) — well? — Iwi) — wr?) + > D vel? 
t=1 t=1 


1 T 
(1) x2, 7 2 

a = ane 

= on Woe 5 3 Mal 
1 n . 

yy 2 a 2 

= ri II + 5 m IIve ll”, 


where the last equality is due to the definition w) = 0. This proves the first part of 
the lemma (Equation (14.5)). The second part follows by upper bounding ||w’*|| by 
B, ||v;|| by p, dividing by T, and plugging in the value of 7. O 


Lemma 14.1 applies to the GD algorithm with v, = V f(w). As we will show 
later in Lemma 14.7, if f is e-Lipschitz, then ||V f(w“)|| < o. We therefore satisfy 
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the lemma’s conditions and achieve the following corollary: 


Corollary 14.2. Let f be a convex, p-Lipschitz function, and let we € 
ATZMIN wy. wi <p) f (Ww). Lf we run the GD algorithm on f for T steps with n= > 
then the output vector w satisfies 
< Be 

JT 
Furthermore, for every € > 0, to achieve f(w) — f(w*) <€«, it suffices to run the GD 
algorithm for a number of iterations that satisfies 


F(w)— f(w") 


2,2 
ee 


T 
=—2 


14.2 SUBGRADIENTS 


The GD algorithm requires that the function f be differentiable. We now generalize 
the discussion beyond differentiable functions. We will show that the GD algorithm 
can be applied to nondifferentiable functions by using a so-called subgradient of 
f (w) at w, instead of the gradient. 

To motivate the definition of subgradients, recall that for a convex function f, 
the gradient at w defines the slope of a tangent that lies below f, that is, 


Vu, f(u) >= f(w)+ (u—w, V f(w)). (14.7) 


An illustration is given on the left-hand side of Figure 14.2. 
The existence of a tangent that lies below f is an important property of convex 
functions, which is in fact an alternative characterization of convexity. 


Lemma 14.3. Let S be an open convex set. A function f : S > R is convex iff for 
every we S there exists v such that 


Vue S, f(u)> f(w)+ (u—w,v). (14.8) 


The proof of this lemma can be found in many convex analysis textbooks (e.g., 
(Borwein & Lewis 2006)). The preceding inequality leads us to the definition of 
subgradients. 


Definition 14.4. (Subgradients). A vector v that satisfies Equation (14.8) is called a 
subgradient of f at w. The set of subgradients of f at w is called the differential set 
and denoted 0 f (w). 


An illustration of subgradients is given on the right-hand side of Figure 14.2. For 
scalar functions, a subgradient of a convex function f at w is a slope of a line that 
touches f at w and is not above f elsewhere. 


14.2.1 Calculating Subgradients 


How do we construct subgradients of a given convex function? If a function is dif- 
ferentiable at a point w, then the differential set is trivial, as the following claim 
shows. 
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Figure 14.2. Left: The right-hand side of Equation (14.7) is the tangent of f at w. Fora 
convex function, the tangent lower bounds f/f. Right: Illustration of several subgradients 
of a nondifferentiable convex function. 


Claim 14.5. If f is differentiable at w then a f(w) contains a single element — the 
gradient of f at w, V f (w). 


Example 14.1 (The Differential Set of the Absolute Function). Consider the abso- 
lute value function f(x) = |x|. Using Claim 14.5, we can easily construct the 
differential set for the differentiable parts of f, and the only point that requires 
special attention is x9 = 0. At that point, it is easy to verify that the subdifferential is 
the set of all numbers between —1 and 1. Hence: 


{1} ifx>0 
af(x)= 4 {-1} if x <0 
[-1,1] ifx=0 


For many practical uses, we do not need to calculate the whole set of subgradi- 
ents at a given point, as one member of this set would suffice. The following claim 
shows how to construct a sub-gradient for pointwise maximum functions. 


Claim 14.6. Let g(w) = max;<,,] gi(w) for r convex differentiable functions g,..., &r- 
Given some w, let j € argmax; g;(w). Then Vgj(w) € 0g(w). 


Proof. Since gj; is convex we have that for all u 
gj(u) = gj(w) + (u—w, Vgj(w)). 
Since g(w) = g;(w) and g(u) > g;(u) we obtain that 
g(u) = g(w) + (u—w, Vgj(w)), 
which concludes our proof. O 


Example 14.2. (A Subgradient of the Hinge Loss). Recall the hinge loss function 
from Section 12.3, f(w) = max{0,1— y(w, x)} for some vector x and scalar y. To 
calculate a subgradient of the hinge loss at some w we rely on the preceding claim 
and obtain that the vector v defined in the following is a subgradient of the hinge 
loss at w: 
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14.2.2 Subgradients of Lipschitz Functions 


Recall that a function f : A —> Ris p-Lipschitz if for allu,v € A 


if(u) — f(v)| <p |lu—vI- 
The following lemma gives an equivalent definition using norms of subgradients. 


Lemma 14.7. Let A be a convex open set and let f : A— R be a convex function. 
Then, f is p-Lipschitz over A iff for allw € A and v € 0 f (w) we have that ||v|| < p. 


Proof. Assume that for all v € 0 f(w) we have that ||v|| < o. Since v € 0 f (w) we have 
f(w) — f(u) s (v,w—u). 
Bounding the right-hand side using Cauchy-Schwartz inequality we obtain 
f(w)— f(a) < (v, wu) < |l¥I| |w—ull < pl|lw—ull. 


An analogous argument can show that f(u) — f(w) < e||w— ul]. Hence f is p- 
Lipschitz. 

Now assume that f is o-Lipschitz. Choose some w € A,v € 0 f(w). Since A is 
open, there exists « > 0 such that u= w+ev/||v|| belongs to A. Therefore, (u—w, v) = 
€||v|| and ||u — w|| = «. From the definition of the subgradient, 


f(u) — f(w) = (v,u—w) = eli. 


On the other hand, from the Lipschitzness of f we have 


pe =p|lu—w|| > f(u)— f(w). 


Combining the two inequalities we conclude that ||v|| < p. O 


14.2.3 Subgradient Descent 


The gradient descent algorithm can be generalized to nondifferentiable functions 
by using a subgradient of f(w) at w, instead of the gradient. The analysis of the 
convergence rate remains unchanged: Simply note that Equation (14.3) is true for 
subgradients as well. 


14.3 STOCHASTIC GRADIENT DESCENT (SGD) 


In stochastic gradient descent we do not require the update direction to be based 
exactly on the gradient. Instead, we allow the direction to be a random vector and 
only require that its expected value at each iteration will equal the gradient direction. 
Or, more generally, we require that the expected value of the random vector will be 
a subgradient of the function at the current vector. 
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Figure 14.3. An illustration of the gradient descent algorithm (left) and the stochastic 
gradient descent algorithm (right). The function to be minimized is 1.25(x +6)? +(y —8)?. 
For the stochastic case, the solid line depicts the averaged value of w. 


Stochastic Gradient Descent (SGD) for minimizing f{(w) 


parameters: Scalar 7 > 0, integer T > 0 

initialize: w) = 0 

fort =1,2,...,T 
choose v, at random from a distribution such that E[v;|w] € af (w) 
update w¢+) = w — ny, 

output w= + ew) 


An illustration of stochastic gradient descent versus gradient descent is given 
in Figure 14.3. As we will see in Section 14.5, in the context of learning problems, 
it is easy to find a random vector whose expectation is a subgradient of the risk 
function. 


1 Analysis of SGD for Convex-Lipschitz-Bounded Functions 


Recall the bound we achieved for the GD algorithm in Corollary 14.2. For the 
stochastic case, in which only the expectation of v; is in 9 f (w), we cannot directly 
apply Equation (14.3). However, since the expected value of v; is a subgradient of 
f at w”), we can still derive a similar bound on the expected output of stochastic 
gradient descent. This is formalized in the following theorem. 


Theorem 14.8. Let B,p > 0. Let f be a convex function and let w* € 
argMINy.|wi<g f(w). Assume that SGD is run for T iterations with n = \/ a 
Assume also that for all t, \|V;|| < @ with probability 1. Then, 


E[f(w)]— f(w*) < a 


a 
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Therefore, for any € > 0, to achieve E| f (w)| — f(w*) < «, it suffices to run the SGD 
algorithm for a number of iterations that satisfies 


B* p? 


T> 
e2 


Proof. Let us introduce the notation vj,., to denote the sequence v,,...,v;. Taking 
expectation of Equation (14.2), we obtain 


EL@)- fw) E a) YW) — Few" IF 


t=1 


Since Lemma 14.1 holds for any sequence vj, V2,...v7, it applies to SGD as well. By 
taking expectation of the bound in the lemma we have 


< 


ie Bo 
D FLW wv) Se (14.9) 


t=1 


It is left to show that 


T 
ap) So (fw) - fo], aps (Ww) — w* a), (14.10) 


f=1 


which we will hereby prove. 
Using the linearity of the expectation we have 


oe ee 
7 Edom we v; | = rae [iw — w*,v,)]. 


f=1 f=1 


< 


Next, we recall the law of total expectation: For every two random variables «, 8, 
and a function g, E,[g(a)] = Eg Ey [g(a)|A]. Setting w = v1, and B = v1+_1 we get 
that 


2 [(w — w*, v,)] = E [(w —w*, v;)] 
Vi:T Vict 


= E = [(w — wv) [Viera 


Vi:r—-1 Vit 


Once we know yj.;_1, the value of w” is not random any more and therefore 


Dw — wv.) Ivieaj= E(w —w*, Ely; |vir—i])- 
Vi:t-1 Vist V1:t-1 vr 


Since w only depends on vj_1 and SGD requires that Ey, [v; |w] € a.f(w”) we 
obtain that Ey, [v; | vi+—1] € af (w”). Thus, 


w® —w E[v,|vie-il) => E [f(w)- f(w*)]. 


Vi:t-1 vr Vivo 


Overall, we have shown that 


E [(w —w*,v))]> EB [f(w)— f(w’)] 


Vil 


=E Lew) fw) 
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Summing over ¢, dividing by 7, and using the linearity of expectation, we get that 
Equation (14.10) holds, which concludes our proof. O 


14.4 VARIANTS 


In this section we describe several variants of Stochastic Gradient Descent. 


14.4.1 Adding a Projection Step 


In the previous analyses of the GD and SGD algorithms, we required that the norm 
of w* will be at most B, which is equivalent to requiring that w* is in the set H = 
{w: ||w|| < B}. In terms of learning, this means restricting ourselves to a B-bounded 
hypothesis class. Yet any step we take in the opposite direction of the gradient (or 
its expected direction) might result in stepping out of this bound, and there is even 
no guarantee that w satisfies it. We show in the following how to overcome this 
problem while maintaining the same convergence rate. 

The basic idea is to add a projection step; namely, we will now have a two-step 
update rule, where we first subtract a subgradient from the current value of w and 
then project the resulting vector onto H. Formally, 


1. wt) = w — py, 


2. wet) — argminy 4, ||w— wt) 


The projection step replaces the current value of w by the vector in H closest to it. 

Clearly, the projection step guarantees that w” € H for all t. Since His convex 
this also implies that w € H as required. We next show that the analysis of SGD with 
projections remains the same. This is based on the following lemma. 


Lemma 14.9 (Projection Lemma). Let H be a closed convex set and let v be the 
projection of w onto H, namely, 


v = argmin ||x — wl". 
xEH. 


Then, for every uE H, 
2 2 
|w — ull” — |lv—ul]* > 0. 


Proof. By the convexity of H, for every a € (0,1) we have that v+ a(u—v) € H. 
Therefore, from the optimality of v we obtain 


2 2 
lv—wll° < llv+a(u—v)—w|| 
= |\v— wI|? + 2a(v —w,u—v) +07 lu —v|/’. 


Rearranging, we obtain 


2(v—w,u—v) > —a |ju—v|l7. 
Taking the limit a > 0 we get that 


(v—w,u—v)>0. 
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Therefore, 
lw — ull? = ||w-v+v—ull? 
= ||w— ||? + lv—ull? +2(v—w, u—v) 
> |Iv—ull’. 
O 


Equipped with the preceding lemma, we can easily adapt the analysis of SGD to 
the case in which we add projection steps on a closed and convex set. Simply note 
that for every f, 


wt D _ w* ||? _ \|w) _ w* ||? 


ae I L 
= wD — wr? — pw 2) — wy)? + jw 2) — we? — jw — Ww"? 


1 
< ||wt2) — wry? — pw — wr]? 


Therefore, Lemma 14.1 holds when we add projection steps and hence the rest of 
the analysis follows directly. 


2 Variable Step Size 


Another variant of SGD is decreasing the step size as a function of t. That is, rather 
than updating with a constant 7, we use 7;. For instance, we can set n; = ra and 
achieve a bound similar to Theorem 14.8. The idea is that when we are closer to the 
minimum of the function, we take our steps more carefully, so as not to “overshoot” 
the minimum. 


14.4.3 Other Averaging Techniques 


We have set the output vector to be w = ye w). There are alternative 
approaches such as outputting w”) for some random t ¢ [r], or outputting the aver- 
age of w) over the last wT iterations, for some a € (0,1). One can also take 
a weighted average of the last few iterates. These more sophisticated averaging 
schemes can improve the convergence speed in some situations, such as in the case 
of strongly convex functions defined in the following. 


14.4.4 Strongly Convex Functions* 


In this section we show a variant of SGD that enjoys a faster convergence rate for 
problems in which the objective function is strongly convex (see Definition 13.4 of 
strong convexity in the previous chapter). We rely on the following claim, which 
generalizes Lemma 13.5. 


Claim 14.10. If f is -strongly convex then for every w,u and v € a f (w) we have 
(w—u,v) > f(w)— f(u)t+$ilw—ull’. 


The proof is similar to the proof of Lemma 13.5 and is left as an exercise. 
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SGD for minimizing a 1-strongly convex function 


Goak: Solve minwex f(w) 

parameter: T 

initialize: w“) = 0 

fort=1,...,T 
Choose a random vector vy; s.t. E[v;|w] € af (w) 
Set n, =1/(At) 
Set wt?) = w — ny, 


Set w°+) = arg minwex ||w— wt?) \|? 
~~ lyt 
output: Ww = = )>)-1 w) 


Theorem 14.11. Assume that f is d-strongly convex and that E[||v;||7] < p?. Let w* € 
argmin,<3, f(w) be an optimal solution. Then, 


o[ f(w)]— f(w") < at + log(T)). 


Proof. Let V =E[v;,|w]. Since f is strongly convex and V“ is in the subgradient 
set of f at w we have that 


(wO —w*, VO) > f(w) — F(w*) + Allw — w*]?. (14.11) 


Next, we show that 


E [Iw — w* ||? — Iwo) —w'?] om 


(w —w*, VO) < +—p". (14.12) 
2m 2 


; or 1 1 
Since wt) is the projection of w’+2) onto H, and w* € H we have that ||w*2) — 


w* ||? > ||w¢+) —w*||?. Therefore, 


1 
Iw — wr ||? — wD — wr? > [pw — we]? — []w4 2) — we |]? 


= 2n:(w —w*, v,)— 7? |lvill?- 


Taking expectation of both sides, rearranging, and using the assumption E[||v, ||] < 
p” yield Equation (14.12). Comparing Equation (14.11) and Equation (14.12) and 
summing over f we obtain 


T 
Ss ELF (w)] — f (w*)) 
t=1 
T yt 2 y(t FD) yy 2 2 7 
; lw) — w* ||? — |lw w* | : p 
< dps eB I || ea 
t=1 2 t=1 


Next, we use the definition 7, = 1/(At) and note that the first sum on the right-hand 
side of the equation collapses to —AT ||w‘7+) — w*|? < 0. Thus, 


i 
YO (EL Fw) — Fw") abn (1 +1og(T)). 
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The theorem follows from the preceding by dividing by T and using Jensen’s 
inequality. O 


Remark 14.3. Rakhlin, Shamir, and Sridharan ((2012)) derived a convergence rate 
in which the log(T) term is eliminated for a variant of the algorithm in which we 
output the average of the last T /2 iterates, w= yr /2-+1 w"), Shamir and Zhang 


(2013) have shown that Theorem 14.11 holds even if we output w= w”), 


14.5 LEARNING WITH SGD 


We have so far introduced and analyzed the SGD algorithm for general convex 
functions. Now we shall consider its applicability to learning tasks. 


14 


5.1 SGD for Risk Minimization 


Recall that in learning we face the problem of minimizing the risk function 
Lp(w) = = [é(w, z)]. 


We have seen the method of empirical risk minimization, where we minimize the 
empirical risk, Ls(w), as an estimate to minimizing Lp(w). SGD allows us to take 
a different approach and minimize Lp(w) directly. Since we do not know D, we 
cannot simply calculate VL p(w“) and minimize it with the GD method. With SGD, 
however, all we need is to find an unbiased estimate of the gradient of Lp(w), that 
is, a random vector whose conditional expected value is VLp(w“)). We shall now 
see how such an estimate can be easily constructed. 

For simplicity, let us first consider the case of differentiable loss functions. Hence 
the risk function Lp is also differentiable. The construction of the random vector v, 
will be as follows: First, sample z ~ D. Then, define v; to be the gradient of the 
function ¢(w,z) with respect to w, at the point w”). Then, by the linearity of the 
gradient we have 


S[v.wO] = E_[ve(w”,z)] = VE [e(w, z)] = VL>(w”). (14.13) 


z~D 
The gradient of the loss function ¢(w, z) at w”) is therefore an unbiased estimate of 
the gradient of the risk function Lp(w") and is easily constructed by sampling a 
single fresh example z ~ D at each iteration f. 

The same argument holds for nondifferentiable loss functions. We simply let v; 
be a subgradient of ¢(w, z) at w“). Then, for every u we have 


(u, z) — &(w, z) = (u—w,v,). 


Taking expectation on both sides with respect to z ~ D and conditioned on the value 
of w) we obtain 


Lp(u) — Lp(w”) = E[é(u, z) — €(w, z)|w] 


[a — w, v,) |W] 


IV 


= (u—w, E[v,|w]). 
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It follows that E[v,|w] is a subgradient of Lp(w) at w. 
To summarize, the stochastic gradient descent framework for minimizing the 
risk is as follows. 


Stochastic Gradient Descent (SGD) for minimizing Lp(w) 


parameters: Scalar 7 > 0, integer T > 0 
initialize: w") = 0 
fort =1,2,...,T 
sample z ~~ D 
pick v, € 0¢(w"), z) 
update w+) = w — nv, 
output W= 7 7, Ww 


We shall now use our analysis of SGD to obtain a sample complexity analysis for 
learning convex-Lipschitz-bounded problems. Theorem 14.8 yields the following: 


Corollary 14.12. Consider a convex-Lipschitz-bounded learning problem with 
parameters p,B. Then, for every « > 0, if we run the SGD method for minimizing 
L-p(w) with a number of iterations (i.e., number of examples) 


B22 


[2 
a) 


and with n = ,/ an then the output of SGD satisfies 


{[Lp(w)] < min Lp(w) +e. 


It is interesting to note that the required sample complexity is of the same order 
of magnitude as the sample complexity guarantee we derived for regularized loss 
minimization. In fact, the sample complexity of SGD is even better than what we 
have derived for regularized loss minimization by a factor of 8. 


-2 Analyzing SGD for Convex-Smooth Learning Problems 


In the previous chapter we saw that the regularized loss minimization rule also 
learns the class of convex-smooth-bounded learning problems. We now show that 
the SGD algorithm can be also used for such problems. 


Theorem 14.13. Assume that for all z, the loss function €(-,z) is convex, B-smooth, 
and nonnegative. Then, if we run the SGD algorithm for minimizing Lp(w) we have 
that for every w*, 


2 1 en _, Iwi? 
4 | L < L ; 
(Lo) <7 ( ow’) + 
Proof. Recall that if a function is 6B-smooth and nonnegative then it is self-bounded: 


lV f (wll? < 26 fw). 


www.EngineeringBooksLibrary.com 


163 


164 


Stochastic Gradient Descent 


To analyze SGD for convex-smooth problems, let us define z1,...,z7 the random 
samples of the SGD algorithm, let f,(-) = €(-,z;), and note that v, = V f,(w). 
For all ¢, f; is a convex function and therefore f,(w) — f;(w*) < (v;,w” — w’). 
Summing over ¢ and using Lemma 14.1 we obtain 


#112 r 
Yen ow!) — flw')) = Div —w") = ee Pome. 
t=1 


t=1 


Combining the preceding with the self-boundedness of f; yields 


5 (lw) — fi(w*)) < —— DS fiw). 


1 


Dividing by T and rearranging, we obtain 


T 
=o filw) < a(F > Al ww EY 
t=1 


Next, we take expectation of the two sides of the preceding equation with respect to 
Z,---,z7- Clearly, E[ f;(w*)] = Lp(w*). In addition, using the same argument as in 
the proof of Theorem 14.8 we have that 


ic ie 
t=1 


t=1 


Combining all we conclude our proof. oO 
As a direct corollary we obtain: 


Corollary 14.14. Consider a convex-smooth-bounded learning problem with param- 
eters B, B. Assume in addition that €(0,z) <1 for all z € Z. For every € > 0, set 


n= ERIE Then, running SGD with T > 12B?B/e? yields 


b[Lp(w)] < min Lp(w) +e. 


14.5.3 SGD for Regularized Loss Minimization 


We have shown that SGD enjoys the same worst-case sample complexity bound 
as regularized loss minimization. However, on some distributions, regularized loss 
minimization may yield a better solution. Therefore, in some cases we may want 
to solve the optimization problem associated with regularized loss minimization, 
namely, 


min (Fimi? + 2s()) (14.14) 


Since we are dealing with convex learning problems in which the loss function is 
convex, the preceding problem is also a convex optimization problem that can be 
solved using SGD as well, as we shall see in this section. 


! We divided 4 by 2 for convenience. 
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14.6 Summary 


Define f(w) = 4$||wll? + Ls(w). Note that f is a A-strongly convex function; 
therefore, we can apply the SGD variant given in Section 14.4.4 (with H = R?). 
To apply this algorithm, we only need to find a way to construct an unbiased esti- 
mate of a subgradient of f at w. This is easily done by noting that if we pick z 
uniformly at random from S, and choose v, € 3¢(w , z) then the expected value of 
aw) +, is a subgradient of f at w”. 

To analyze the resulting algorithm, we first rewrite the update rule (assuming 
that H = R¢ and therefore the projection step does not matter) as follows 


1 
(+1) _ wt) _ () 
Ww =Ww 7 (aw + vi) 


ia ee 
t dt 
_t-1/t-2 @y 1 - Ls 
t t—1 A(t—1) At 
1 t 
er v- (14.15) 


If we assume that the loss function is p-Lipschitz, it follows that for all t we have 
\Iv;|| <p and therefore ||Aw || < p, which yields 


aw” +, || <2p. 


Theorem 14.11 therefore tells us that after performing T iterations we have that 


—s a _ 407 
Lf(w)] — f(w*) < yp iit log (7). 


14.6 SUMMARY 


We have introduced the Gradient Descent and Stochastic Gradient Descent algo- 
rithms, along with several of their variants. We have analyzed their convergence rate 
and calculated the number of iterations that would guarantee an expected objective 
of at most € plus the optimal objective. Most importantly, we have shown that by 
using SGD we can directly minimize the risk function. We do so by sampling a 
point iid from D and using a subgradient of the loss of the current hypothesis w 
at this point as an unbiased estimate of the gradient (or a subgradient) of the risk 
function. This implies that a bound on the number of iterations also yields a sam- 
ple complexity bound. Finally, we have also shown how to apply the SGD method 
to the problem of regularized risk minimization. In future chapters we show how 
this yields extremely simple solvers to some optimization problems associated with 
regularized risk minimization. 
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14.7 BIBLIOGRAPHIC REMARKS 


SGD dates back to Robbins and Monro (1951). It is especially effective in large scale 
machine learning problems. See, for example, (Murata 1998, Le Cun 2004, Zhang 
2004, Bottou & Bousquet 2008, Shalev-Shwartz, Singer & Srebro 2007, Shalev- 
Shwartz & Srebro 2008). In the optimization community it was studied in the context 
of stochastic optimization. See, for example, (Nemirovski & Yudin 1978, Nesterov & 
Nesterov 2004, Nesterov 2005, Nemirovski, Juditsky, Lan & Shapiro 2009, Shapiro, 
Dentcheva & Ruszczyfiski 2009). 

The bound we have derived for strongly convex function is due to Hazan, 
Agarwal, and Kale (2007). As mentioned previously, improved bounds have been 
obtained in Rakhlin, Shamir & Sridharan (2012). 


14.8 EXERCISES 


14.1 Prove Claim 14.10. Hint: Extend the proof of Lemma 13.5. 

14.2 Prove Corollary 14.14. 

14.3 Perceptron as a subgradient descent algorithm: Let S = ((x1, y1),..., (Xm, Ym)) € 
(IR¢ x {+1})”. Assume that there exists w € R¢ such that for every i € [m] we have 
y;(w, x;) => 1, and let w* be a vector that has the minimal norm among all vectors 
that satisfy the preceding requirement. Let R = max; ||x;||. Define a function 

f(w) = oes (w, xi)). 
elm 


ie[n 


® Show that miny.)wi<jw+ f(w) = 0 and show that any w for which f(w) < 1 
separates the examples in S. 

® Show how to calculate a subgradient of f. 

™ Describe and analyze the subgradient descent algorithm for this case. Com- 
pare the algorithm and the analysis to the Batch Perceptron algorithm given in 
Section 9.1.2. 

14.4 Variable step size (*): Prove an analog of Theorem 14.8 for SGD with a variable 
step size, ny = at 
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Support Vector Machines 


In this chapter and the next we discuss a very useful machine learning tool: the 
support vector machine paradigm (SVM) for learning linear predictors in high 
dimensional feature spaces. The high dimensionality of the feature space raises both 
sample complexity and computational complexity challenges. 

The SVM algorithmic paradigm tackles the sample complexity challenge by 
searching for “large margin” separators. Roughly speaking, a halfspace separates 
a training set with a large margin if all the examples are not only on the correct 
side of the separating hyperplane but also far away from it. Restricting the algo- 
rithm to output a large margin separator can yield a small sample complexity even 
if the dimensionality of the feature space is high (and even infinite). We introduce 
the concept of margin and relate it to the regularized loss minimization paradigm as 
well as to the convergence rate of the Perceptron algorithm. 

In the next chapter we will tackle the computational complexity challenge using 
the idea of kernels. 


15.1 MARGIN AND HARD-SVM 


Let S = (x1, y1),-..,(Xm,¥m) be a training set of examples, where each x; € R¢ and 
yj € {+1}. We say that this training set is linearly separable, if there exists a halfspace, 
(w,b), such that y; = sign((w,x;) +b) for alli. Alternatively, this condition can be 
rewritten as 


Vie [m], yi ((w, X;) +b) > 0. 


All halfspaces (w,b) that satisfy this condition are ERM hypotheses (their 0-1 
error is zero, which is the minimum possible error). For any separable training 
sample, there are many ERM halfspaces. Which one of them should the learner 
pick? 

Consider, for example, the training set described in the picture that 
follows. 
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While both the dashed and solid hyperplanes separate the four examples, our intu- 
ition would probably lead us to prefer the dashed hyperplane over the solid one. 
One way to formalize this intuition is using the concept of margin. 

The margin of a hyperplane with respect to a training set is defined to be the 
minimal distance between a point in the training set and the hyperplane. If a hyper- 
plane has a large margin, then it will still separate the training set even if we slightly 
perturb each instance. 

We will see later on that the true error of a halfspace can be bounded in terms 
of the margin it has over the training sample (the larger the margin, the smaller the 
error), regardless of the Euclidean dimension in which this halfspace resides. 

Hard-SVM is the learning rule in which we return an ERM hyperplane that 
separates the training set with the largest possible margin. To define Hard-SVM 
formally, we first express the distance between a point x to a hyperplane using the 
parameters defining the halfspace. 


Claim 15.1. The distance between a point x and the hyperplane defined by (w, b) 
where ||w|| = 1 is |(w, x) + DI. 


Proof. The distance between a point x and the hyperplane is defined as 
min{||x — v|| : (w, v) +b =O}. 
Taking v = x — ((w, x) + b)w we have that 
(w, v) +b = (w, x) — ((w, x) +5) ||wil? +b =0, 


and 
{|x — v|] = |(w, x) + D| ||w|] = |(w, x) +9]. 


Hence, the distance is at most |(w,x) + b|. Next, take any other point u on the 
hyperplane, thus (w, u) +b = 0. We have 


IIx — ull? = |Ix—v+v—ull? 
= |x — v||7 + |v —ull? + 2(x —v, v—u) 
> x —v|* +2(x—v,v—u) 


= ||x — v||? +2((w,x) +) (w, v—u) 


= ||x—vI|, 
where the last equality is because (w, v) = (w, u) = —b. Hence, the distance between 
x and wis at least the distance between x and v, which concludes our proof. O 
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15.1.1 The Homogenous Case 


It is often more convenient to consider homogenous halfspaces, namely, halfspaces 
that pass through the origin and are thus defined by sign((w, x)), where the bias term 
b is set to be zero. Hard-SVM for homogenous halfspaces amounts to solving 


min ||w||?_ s.t. Vi, y;(w,x;) > 1. (15.3) 
Ww 


As we discussed in Chapter 9, we can reduce the problem of learning 
nonhomogenous halfspaces to the problem of learning homogenous halfspaces by 
adding one more feature to each instance of x;, thus increasing the dimension to 
d+. 

Note, however, that the optimization problem given in Equation (15.2) does not 
regularize the bias term b, while if we learn a homogenous halfspace in R¢+! using 
Equation (15.3) then we regularize the bias term (i.e., the d+ 1 component of the 
weight vector) as well. However, regularizing b usually does not make a significant 
difference to the sample complexity. 


15.1.2 The Sample Complexity of Hard-SVM 


Recall that the VC-dimension of halfspaces in R@ is d + 1. It follows that the sample 
complexity of learning halfspaces grows with the dimensionality of the problem. 
Furthermore, the fundamental theorem of learning tells us that if the number of 
examples is significantly smaller than d/e then no algorithm can learn an e-accurate 
halfspace. This is problematic when d is very large. 

To overcome this problem, we will make an additional assumption on the under- 
lying data distribution. In particular, we will define a “separability with margin y” 
assumption and will show that if the data is separable with margin y then the sam- 
ple complexity is bounded from above by a function of 1/y?. It follows that even if 
the dimensionality is very large (or even infinite), as long as the data adheres to the 
separability with margin assumption we can still have a small sample complexity. 
There is no contradiction to the lower bound given in the fundamental theorem of 
learning because we are now making an additional assumption on the underlying 
data distribution. 

Before we formally define the separability with margin assumption, there is a 
scaling issue we need to resolve. Suppose that a training set S = (x1, y1),..., (Xm, Ym) 
is separable with a margin y; namely, the maximal objective value of Equation (15.1) 
is at least y. Then, for any positive scalar a > 0, the training set S’ = 
(@X1, y1),---,(@Xm,¥m) is separable with a margin of ay. That is, a simple scaling 
of the data can make it separable with an arbitrarily large margin. It follows that in 
order to give a meaningful definition of margin we must take into account the scale 
of the examples as well. One way to formalize this is using the definition that follows. 


Definition 15.3. Let D be a distribution over R¢ x {+1}. We say that D is separable 
with a (y,)-margin if there exists (w*,b*) such that ||w*|| = 1 and such that with 
probability 1 over the choice of (x, y) ~~ D we have that y((w*, x) +b*) > y and ||x|| < 
p. Similarly, we say that D is separable with a (y, o)-margin using a homogenous 
halfspace if the preceding holds with a halfspace of the form (w*, 0). 
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We can rewrite Equation (15.4) as a regularized loss minimization problem. 
Recall the definition of the hinge loss: 


ehinge ((w, b), (x, y)) = max{0, 1 — y((w, x) F b)}. 


Given (w,b) and a training set S, the averaged hinge loss on S is denoted by 


LUne’ ((w, b)). Now, consider the regularized loss minimization problem: 


min (allwi?+Ls"**((w.b))) (15.5) 


Claim 15.5. Equation (15.4) and Equation (15.5) are equivalent. 


Proof. Fix some w,b and consider the minimization over € in Equation (15.4). 
Fix some i. Since € must be nonnegative, the best assignment to &; would be 0 
if y;((w,x;) +b) => 1 and would be 1 — y;((w,x;) +b) otherwise. In other words, 
&; = ¢>inse((w, b), (x;, y;)) for all i, and the claim follows. Oo 


We therefore see that Soft-SVM falls into the paradigm of regularized loss min- 
imization that we studied in the previous chapter. A Soft-SVM algorithm, that is, a 
solution for Equation (15.5), has a bias toward low norm separators. The objective 
function that we aim to minimize in Equation (15.5) penalizes not only for training 
errors but also for large norm. 

It is often more convenient to consider Soft-SVM for learning a homogenous 
halfspace, where the bias term b is set to be zero, which yields the following 
optimization problem: 


min (alwi? + L$" (w)), (15.6) 
where 


; 1 m 
129) = 2S maxt0.1 9.x) 


i=1 


15.2.1 The Sample Complexity of Soft-SVM 


We now analyze the sample complexity of Soft-SVM for the case of homogenous 
halfspaces (namely, the output of Equation (15.6)). In Corollary 13.8 we derived 
a generalization bound for the regularized loss minimization framework assuming 
that the loss function is convex and Lipschitz. We have already shown that the hinge 
loss is convex so it is only left to analyze the Lipschitzness of the hinge loss. 


Claim 15.6. Let f(w) = max{0, 1 — y(w,x)}. Then, f is ||x||-Lipschitz. 


Proof. It is easy to verify that any subgradient of f at w is of the form ax where 
|a| <1. The claim now follows from Lemma 14.7. O 


Corollary 13.8 therefore yields the following: 


Corollary 15.7. Let D be a distribution over X x {0,1}, where X = {x: ||x|| < p}. 
Consider running Soft-SVM (Equation (15.6)) on a training set S~D" and let A(S) 
be the solution of Soft-SVM. Then, for every u, 


2 2 
LI (A(S))] < LBP cw) + altull? + 
S~D™ Am 
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Furthermore, since the hinge loss upper bounds the 0—1 loss we also have 


hi 2p” 
[Lp '(A(S))] = Lp(a) +aliul? +. 


S~pm 


° 2 
Last, for every B > 0, if we set = 4/ a5 then 


2 [LS '(A(S))] < .B (LN(A(S))] < _ min LEE) +f S22 
s.-pm D —~ svpm D ~ w:||w||<B = m 


We therefore see that we can control the sample complexity of learning a half- 
space as a function of the norm of that halfspace, independently of the Euclidean 
dimension of the space over which the halfspace is defined. This becomes highly 
significant when we learn via embeddings into high dimensional feature spaces, as 
we will consider in the next chapter. 


Remark 15.2. The condition that Y will contain vectors with a bounded norm fol- 
lows from the requirement that the loss function will be Lipschitz. This is not just 
a technicality. As we discussed before, separation with large margin is meaningless 
without imposing a restriction on the scale of the instances. Indeed, without a con- 
straint on the scale, we can always enlarge the margin by multiplying all instances 
by a large scalar. 


15.2.2 Margin and Norm-Based Bounds versus Dimension 


The bounds we have derived for Hard-SVM and Soft-SVM do not depend on the 
dimension of the instance space. Instead, the bounds depend on the norm of the 
examples, o, the norm of the halfspace B (or equivalently the margin parameter 
y) and, in the nonseparable case, the bounds also depend on the minimum hinge 
loss of all halfspaces of norm < B. In contrast, the VC-dimension of the class of 
homogenous halfspaces is d, which implies that the error of an ERM hypothesis 
decreases as ./d/m does. We now give an example in which p* B* « d; hence the 
bound given in Corollary 15.7 is much better than the VC bound. 

Consider the problem of learning to classify a short text document according 
to its topic, say, whether the document is about sports or not. We first need to 
represent documents as vectors. One simple yet effective way is to use a bag-of- 
words representation. That is, we define a dictionary of words and set the dimension 
d to be the number of words in the dictionary. Given a document, we represent it 
as a vector x € {0, 1}, where x; = 1 if the i’th word in the dictionary appears in the 
document and x; = 0 otherwise. Therefore, for this problem, the value of p will be 
the maximal number of distinct words in a given document. 

A halfspace for this problem assigns weights to words. It is natural to assume 
that by assigning positive and negative weights to a few dozen words we will be 
able to determine whether a given document is about sports or not with reasonable 
accuracy. Therefore, for this problem, the value of B? can be set to be less than 100. 
Overall, it is reasonable to say that the value of B* p? is smaller than 10,000. 

On the other hand, a typical size of a dictionary is much larger than 10,000. For 
example, there are more than 100,000 distinct words in English. We have therefore 
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shown a problem in which there can be an order of magnitude difference between 
learning a halfspace with the SVM rule and learning a halfspace using the vanilla 
ERM rule. 

Of course, it is possible to construct problems in which the SVM bound will be 
worse than the VC bound. When we use SVM, we in fact introduce another form 
of inductive bias — we prefer large margin halfspaces. While this inductive bias can 
significantly decrease our estimation error, it can also enlarge the approximation 
error. 


15.2.3 The Ramp Loss* 


The margin-based bounds we have derived in Corollary 15.7 rely on the fact that 
we minimize the hinge loss. As we have shown in the previous subsection, the 
term \/p?B*/m can be much smaller than the corresponding term in the VC bound, 
./d/m. However, the approximation error in Corollary 15.7 is measured with respect 
to the hinge loss while the approximation error in VC bounds is measured with 
respect to the 0—1 loss. Since the hinge loss upper bounds the 0—1 loss, the approx- 
imation error with respect to the 0—1 loss will never exceed that of the hinge 
loss. 

It is not possible to derive bounds that involve the estimation error term 
\/ p2B2/m for the 0-1 loss. This follows from the fact that the 0—1 loss is scale 
insensitive, and therefore there is no meaning to the norm of w or its margin when 
we measure error with the 0—1 loss. However, it is possible to define a loss function 
that on one hand it is scale sensitive and thus enjoys the estimation error \/ 02 B2/m 
while on the other hand it is more similar to the 0—1 loss. One option is the ramp 
loss, defined as 


e™™P (w, (x, y)) = min{1, 22!"2°(w, (x, y))} = min{1, max{0, 1 — y(w,x)}}. 


The ramp loss penalizes mistakes in the same way as the 0—1 loss and does not 
penalize examples that are separated with margin. The difference between the ramp 
loss and the 0—1 loss is only with respect to examples that are correctly classified but 
not with a significant margin. Generalization bounds for the ramp loss are given in 
the advanced part of this book. 


. * 
hinge * 


gramp e 
port e 


oy w, x) 
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15.4 Duality 


The reason SVM relies on the hinge loss and not on the ramp loss is that the 
hinge loss is convex and, therefore, from the computational point of view, min- 
imizing the hinge loss can be performed efficiently. In contrast, the problem of 
minimizing the ramp loss is computationally intractable. 


15.3 OPTIMALITY CONDITIONS AND “SUPPORT VECTORS”* 


The name “Support Vector Machine” stems from the fact that the solution of hard- 
SVM, wo, is supported by (i.e., is in the linear span of) the examples that are exactly 
at distance 1/||wo|| from the separating hyperplane. These vectors are therefore 
called support vectors. To see this, we rely on Fritz John optimality conditions. 


Theorem 15.8. Let wo be as defined in Equation (15.3) and let I = {i : |(wo, x;)| = 1}. 


Then, there exist coefficients a, ..., Om such that 
Wo = SS aixi ‘ 
ie] 


The examples {x; :i € 7} are called support vectors. 
The proof of this theorem follows by applying the following lemma to 
Equation (15.3). 


Lemma 15.9 (Fritz John). Suppose that 
w* cargmin f(w) s.t. Vie[m], gi(w) <0, 
Ww 


where f, 21,---,&m are differentiable. Then, there exists « € R” such that V f (w*) + 
Die i Vgi(w*) = 0, where I = {i : g;(w*) = 0}. 


15.4 DUALITY* 


Historically, many of the properties of SVM have been obtained by considering 
the dual of Equation (15.3). Our presentation of SVM does not rely on duality. For 
completeness, we present in the following how to derive the dual of Equation (15.3). 

We start by rewriting the problem in an equivalent form as follows. Consider the 
function 


g(w) = max aj(1 — yi(w, xi)) = 
acR™:a>0~4 


i= 


= O if Vi, y;(w,x;) > 1 
oo =. otherwise , 


We can therefore rewrite Equation (15.3) as 
2 
min (wl? + g(w)). (15.7) 


Rearranging the preceding we obtain that Equation (15.3) can be rewritten as the 
problem 


: 1 a 
min max (Simi? Sac— iw), (15.8) 


aeR” :a>0 ‘ 
i=1 
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Recall that, on the basis of Equation (14.15), we can rewrite the update rule of 
SGD as 
1 t 
(EFL) ; 
ie) Su 
j= 


where v; is a subgradient of the loss function at w\) on the random example chosen 
at iteration j. For the hinge loss, given an example (x, y), we can choose v; to be 0 if 
y(w),x) > 1 and v; =—yx otherwise (see Example 14.2). Denoting 9% =-y 
we obtain the following procedure. 


jet Vi 


SGD for Solving Soft-SVM 


goal: Solve Equation (15.12) 
parameter: T 
initialize: 0‘) = 0 
fort =1,...,T7 
Let w) = Lo 


Choose i uniformly at random from [m] 
If (yi(w,x;) <1) 
Set OT) — 9 4+ yix; 
Else 
Set 00F) = 9 
output: w= 407, w) 


15.6 SUMMARY 


SVM is an algorithm for learning halfspaces with a certain type of prior knowledge, 
namely, preference for large margin. Hard-SVM seeks the halfspace that separates 
the data perfectly with the largest margin, whereas soft-SVM does not assume sep- 
arability of the data and allows the constraints to be violated to some extent. The 
sample complexity for both types of SVM is different from the sample complexity 
of straightforward halfspace learning, as it does not depend on the dimension of the 
domain but rather on parameters such as the maximal norms of x and w. 

The importance of dimension-independent sample complexity will be realized 
in the next chapter, where we will discuss the embedding of the given domain into 
some high dimensional feature space as means for enriching our hypothesis class. 
Such a procedure raises computational and sample complexity problems. The lat- 
ter is solved by using SVM, whereas the former can be solved by using SVM with 
kernels, as we will see in the next chapter. 


15.7 BIBLIOGRAPHIC REMARKS 


SVMs have been introduced in (Cortes and Vapnik 1992, Boser, Guyor and Vapnik 
1992). There are many good books on the theoretical and practical aspects of SVMs. 
For example, (Vapnik 1995, Cristianini & Shawe-Taylor 2000, Schélkopf & Smola 
2002, Hsu et al. 2003, Steinwart and Christmann 2008). Using SGD for solving soft- 
SVM has been proposed in Shalev-Shwartz et al. (2007). 
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15.8 EXERCISES 
15.1 Show that the hard-SVM rule, namely, 


argmax min|(w,x;)+)| s.t. Vi, y;((w,x;)+)) > 0, 
(w,b):wy=1. <b") 


is equivalent to the following formulation: 


argmax min y;((w,x;)+)). 
(w,b):||wi=1 @€l] 
Hint: Define G = {(w, b) : Vi, yi((w, x;) +.) > O}. 
1. Show that 


argmax min y;((w,x;)+b)eG 
(w,b):\Jwil=1 @€P] 


2. Show that V(w, b) €G, 


min y;((w,x;) +5) = min |(w, x;) + 5] 
ie[m] ie[m] 


(15.13) 


15.2 Margin and the Perceptron Consider a training set that is linearly separable with a 
margin y and such that all the instances are within a ball of radius p. Prove that the 
maximal number of updates the Batch Perceptron algorithm given in Section 9.1.2 


will make when running on this training set is (o/y)’. 
15.3 Hard versus soft SVM: Prove or refute the following claim: 


There exists 2 > 0 such that for every sample S of m > 1 examples, which is separa- 
ble by the class of homogenous halfspaces, the hard-SVM and the soft-SVM (with 


parameter i) learning rules return exactly the same weight vector. 


15.4 Weak duality: Prove that for any function f of two vector variables x E V, ye Jy, it 


holds that 
mines f%y)= neem f(xy). 
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In the previous chapter we described the SVM paradigm for learning halfspaces in 
high dimensional feature spaces. This enables us to enrich the expressive power of 
halfspaces by first mapping the data into a high dimensional feature space, and then 
learning a linear predictor in that space. This is similar to the AdaBoost algorithm, 
which learns a composition of a halfspace over base hypotheses. While this approach 
greatly extends the expressiveness of halfspace predictors, it raises both sample 
complexity and computational complexity challenges. In the previous chapter we 
tackled the sample complexity issue using the concept of margin. In this chapter we 
tackle the computational complexity challenge using the method of kernels. 

We start the chapter by describing the idea of embedding the data into a high 
dimensional feature space. We then introduce the idea of kernels. A kernel is a 
type of a similarity measure between instances. The special property of kernel sim- 
ilarities is that they can be viewed as inner products in some Hilbert space (or 
Euclidean space of some high dimension) to which the instance space is virtually 
embedded. We introduce the “kernel trick” that enables computationally efficient 
implementation of learning, without explicitly handling the high dimensional rep- 
resentation of the domain instances. Kernel based learning algorithms, and in 
particular kernel-SVM, are very useful and popular machine learning tools. Their 
success may be attributed both to being flexible for accommodating domain spe- 
cific prior knowledge and to having a well developed set of efficient implementation 
algorithms. 


16.1 EMBEDDINGS INTO FEATURE SPACES 


The expressive power of halfspaces is rather restricted — for example, the following 
training set is not separable by a halfspace. 

Let the domain be the real line; consider the domain points {—10, —9, —8,...,0, 
1,...,9, 10} where the labels are +1 for all x such that |x| > 2 and —1 otherwise. 

To make the class of halfspaces more expressive, we can first map the original 
instance space into another space (possibly of a higher dimension) and then learn a 
halfspace in that space. For example, consider the example mentioned previously. 
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Instead of learning a halfspace in the original representation let us first define a 
mapping y : R > R?’ as follows: 


W(x) = (x, x7). 


We use the term feature space to denote the range of yw. After applying w the 
data can be easily explained using the halfspace h(x) = sign((w, w(x)) — b), where 
w=(0, 1) andb=5. 

The basic paradigm is as follows: 


1. Given some domain set ¥ and a learning task, choose a mapping ¥: ¥ > Ff, 
for some feature space F, that will usually be R” for some n (however, the 
range of such a mapping can be any Hilbert space, including such spaces of 
infinite dimension, as we will show later). 

2. Given a sequence of labeled examples, S = (x1, y1),.--, (Xm, Ym), Create the 
image sequence S = (W(x1), y1),---,(W(&m), Ym): 

3. Train a linear predictor h over S. 

4. Predict the label of a test point, x, to be h((x)). 


Note that, for every probability distribution D over ¥ x Y, we can readily 
define its image probability distribution DY over F x y by setting, for every subset 
ACFxy, DY(A) = D(w!(A)).! It follows that for every predictor h over the 
feature space, Lpy(h) =Lp(how), where ho py is the composition of h onto y. 

The success of this learning paradigm depends on choosing a good w for a given 
learning task: that is, a y that will make the image of the data distribution (close to 
being) linearly separable in the feature space, thus making the resulting algorithm a 
good learner for a given task. Picking such an embedding requires prior knowledge 
about that task. However, often some generic mappings that enable us to enrich 
the class of halfspaces and extend its expressiveness are used. One notable example 
is polynomial mappings, which are a generalization of the w we have seen in the 
previous example. 

Recall that the prediction of a standard halfspace classifier on an instance x is 
based on the linear mapping x > (w,x). We can generalize linear mappings to a 
polynomial mapping, x + p(x), where p is a multivariate polynomial of degree 
k. For simplicity, consider first the case in which x is 1 dimensional. In that case, 


p(x) = SS w;x/, where w € R‘*! is the vector of coefficients of the polynomial we 
need to learn. We can rewrite p(x) = (w, v/(x)) where y : R > R‘*! is the mapping 
xt (1, x, x7, x3,...,x*). It follows that learning a k degree polynomial over R can 
be done by learning a linear mapping in the (k + 1) dimensional feature space. 


More generally, a degree k multivariate polynomial from R” to R can be 


written as : 
p(x) = >. wy | [ xy. (16.1) 


Jeé[n]"r<k i=1 


As before, we can rewrite p(x) = (w, (x)) where now y : R” — R¢ is such that 
for every J € [n]", r <k, the coordinate of w(x) associated with J is the monomial 


= Xj. 


! This is defined for every A such that ~—!(A) is measurable with respect to D. 
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Naturally, polynomial-based classifiers yield much richer hypothesis classes than 
halfspaces. We have seen at the beginning of this chapter an example in which the 
training set, in its original domain (4 = R), cannot be separable by a halfspace, but 
after the embedding x +> (x, x7) it is perfectly separable. So, while the classifier 
is always linear in the feature space, it can have highly nonlinear behavior on the 
original space from which instances were sampled. 

In general, we can choose any feature mapping w that maps the original 
instances into some Hilbert space.* The Euclidean space R¢@ is a Hilbert space for 
any finite d. But there are also infinite dimensional Hilbert spaces (as we shall see 
later on in this chapter). 

The bottom line of this discussion is that we can enrich the class of halfspaces by 
first applying a nonlinear mapping, w, that maps the instance space into some fea- 
ture space, and then learning a halfspace in that feature space. However, if the range 
of w is a high dimensional space we face two problems. First, the VC-dimension of 
halfspaces in R” is n+ 1, and therefore, if the range of w is very large, we need 
many more samples in order to learn a halfspace in the range of y. Second, from 
the computational point of view, performing calculations in the high dimensional 
space might be too costly. In fact, even the representation of the vector w in the 
feature space can be unrealistic. The first issue can be tackled using the paradigm 
of large margin (or low norm predictors), as we already discussed in the previous 
chapter in the context of the SVM algorithm. In the following section we address 
the computational issue. 


16.2 THE KERNEL TRICK 


We have seen that embedding the input space into some high dimensional feature 
space makes halfspace learning more expressive. However, the computational com- 
plexity of such learning may still pose a serious hurdle — computing linear separators 
over very high dimensional data may be computationally expensive. The common 
solution to this concern is kernel based learning. The term “kernels” is used in 
this context to describe inner products in the feature space. Given an embedding 
w of some domain space ¥ into some Hilbert space, we define the kernel func- 
tion K (x,x’) = (W(x), w(x’)). One can think of K as specifying similarity between 
instances and of the embedding wy as mapping the domain set 7 into a space where 
these similarities are realized as inner products. It turns out that many learning algo- 
rithms for halfspaces can be carried out just on the basis of the values of the kernel 
function over pairs of domain points. The main advantage of such algorithms is that 
they implement linear separators in high dimensional feature spaces without hav- 
ing to specify points in that space or expressing the embedding y explicitly. The 
remainder of this section is devoted to constructing such algorithms. 


2 A Hilbert space is a vector space with an inner product, which is also complete. A space is complete if 
all Cauchy sequences in the space converge. In our case, the norm ||w|| is defined by the inner product 
/(w, W). The reason we require the range of to be in a Hilbert space is that projections in a Hilbert 
space are well defined. In particular, if M is a linear subspace of a Hilbert space, then every x in the 
Hilbert space can be written as asum x = u+v where ue M and (v, w) = 0 for all we M. We use this 
fact in the proof of the representer theorem given in the next section. 
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In the previous chapter we saw that regularizing the norm of w yields a small 
sample complexity even if the dimensionality of the feature space is high. Interest- 
ingly, as we show later, regularizing the norm of w is also helpful in overcoming the 
computational problem. To do so, first note that all versions of the SVM optimiza- 
tion problem we have derived in the previous chapter are instances of the following 
general problem: 


min (f ((w. ¥(81)) »---+(W. W(Sn))) + RCI). (16.2) 


where f : R” — R is an arbitrary function and R: R; — R is a monotoni- 
cally nondecreasing function. For example, Soft-SVM for homogenous halfspaces 
(Equation (15.6)) can be derived from Equation (16.2) by letting R(a) = Aa” and 
f(a1,.--,4m) = ‘yy; max{0,1— y;a;}. Similarly, Hard-SVM for nonhomogenous 
halfspaces (Equation (15.2)) can be derived from Equation (16.2) by letting 
R(a) = a’ and letting f(a),...,dm) be 0 if there exists b such that y;(a; +b) > 1 
for alli, and f(ay,...,dm) =o otherwise. 

The following theorem shows that there exists an optimal solution of 
Equation (16.2) that lies in the span of {W(x1),..., W(Xm)}- 


Theorem 16.1 (Representer Theorem). Assume that w is a mapping from X to a 
Hilbert space. Then, there exists a vector « € R” such that w = >", a;W(x;) is an 
optimal solution of Equation (16.2). 


Proof. Let w* be an optimal solution of Equation (16.2). Because w* is an element 
of a Hilbert space, we can rewrite w* as 


m 
w= So aiw(xi)+u, 
i=1 


where (u, y(x;)) = 0 for all i. Set w = w* —u. Clearly, |\w*||? = ||wll? + lull’, 
thus ||w|| < |lw*||. Since R is nondecreasing we obtain that R(||w||) < R(||w*||). 
Additionally, for all i we have that 


yi (w, W(X; )) = yi(w* —u, W(x; )) = yi (w*, W(x;)), 


hence 


Ff (yi (Ww, W(X1)) 0+ Yn (W, W(Km))) = F (v1 Cw", W(K1)) 5-2 mW", WKm)))- 


We have shown that the objective of Equation (16.2) at w cannot be larger than the 
objective at w* and therefore w is also an optimal solution. Since w = )7"., ai W(x;) 
we conclude our proof. O 


On the basis of the representer theorem we can optimize Equation (16.2) 


with respect to the coefficients a instead of the coefficients w as follows. Writing 
w= >o_1 0; u(x;) we have that for all i 


(w, W(xi)) = (Seve. ws) = S 6 aj(w(x;), Wx). 
J 


j=l 
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Similarly, 


Iwi? = (Sante Daven = So aja; (W(xi), W(x;)). 
J J 


i,j=l 


Let K(x, x’) = (w(x), w(x’)) be a function that implements the kernel function with 
respect to the embedding yw. Instead of solving Equation (16.2) we can solve the 
equivalent problem 


acR™” 


m m 
min f GK Ge) RD) 
j=l 


j=l 


m 
ba aja K(X;,Xi) 


i,j=l 


+R (16.3) 


To solve the optimization problem given in Equation (16.3), we do not need any 
direct access to elements in the feature space. The only thing we should know is 
how to calculate inner products in the feature space, or equivalently, to calculate 
the kernel function. In fact, to solve Equation (16.3) we solely need to know the 
value of the m x m matrix G s.t. G;,; = K(x;,x;), which is often called the Gram 
matrix. 

In particular, specifying the preceding to the Soft-SVM problem given in 
Equation (15.6), we can rewrite the problem as 


1 m 
in | Aw? = Oi1s% ad), 16.4 
su ( a nD ea ’ yi(Ga) ) ( ) 


where (Ga); is the i’th element of the vector obtained by multiplying the Gram 
matrix G by the vector w. Note that Equation (16.4) can be written as quadratic 
programming and hence can be solved efficiently. In the next section we describe an 
even simpler algorithm for solving Soft-SVM with kernels. 

Once we learn the coefficients « we can calculate the prediction on a new 
instance by 


(w, W(x) = So aj (w(x;), W&)) = Saf K(x;.x). 
j=l j=l 


The advantage of working with kernels rather than directly optimizing w in 
the feature space is that in some situations the dimension of the feature space 
is extremely large while implementing the kernel function is very simple. A few 
examples are given in the following. 


Example 16.1 (Polynomial Kernels). The k degree polynomial kernel is defined 
to be 


K(x,x’)=(1+ (x,x’))*. 


Now we will show that this is indeed a kernel function. That is, we will show that 
there exists a mapping w from the original space to some higher dimensional space 
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for which K (x, x’) = (w(x), W(x’)). For simplicity, denote xo = xj = 1. Then, we have 


K(x,x’) = (1+ (x, x’)! = (1+ (x,x’))---- (1+ (x, x) 


| 

ta 
au 

S 
a 

ta 
Q 

s 
Q 


k 
/ 
» Thu, 


Je{0,1,...,n}* i=1 


k k 
/ 
- © Telly 


Je{0,1,..,n}ki=1 i=l 


Now, if we define yw: R" > R@+)* such that for J € {0, 1,...,}* there is an element 
of w(x) that equals ae x ,, we obtain that 


K(x, x’) = (W(x), W(x’). 


Since w contains all the monomials up to degree k, a halfspace over the range 
of y% corresponds to a polynomial predictor of degree k over the original space. 
Hence, learning a halfspace with a k degree polynomial kernel enables us to learn 
polynomial predictors of degree k over the original space. 

Note that here the complexity of implementing K is O(n) while the dimension 
of the feature space is on the order of n*. 


Example 16.2 (Gaussian Kernel). Let the original instance space be R and con- 
sider the mapping w where for each nonnegative integer n > 0 there exists an 
2 


element y(x), that equals a e- 2 x". Then, 
(oe) 1 x2 1 (x!)? 
x), W(x')) = es") —=e 2 (x’')" 
(WO), WO) “(FF (za wy) 
_ xt+(x/? foe} (x)? 
=e : a n! ) 
n=0 
_ dex’? 
=e 2 ~ 


Here the feature space is of infinite dimension while evaluating the kernel is very 
simple. More generally, given a scalar o > 0, the Gaussian kernel is defined to be 


_ x-x'? 


K(x,x')=e 20 


Intuitively, the Gaussian kernel sets the inner product in the feature space 
between x,x’ to be close to zero if the instances are far away from each other (in 
the original domain) and close to 1 if they are close. o is a parameter that controls 
the scale determining what we mean by “close.” It is easy to verify that K imple- 
ments an inner product in a space in which for any n and any monomial of order k 
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represent an inner product between w(x) and w(x’) for some feature mapping yy? 
The following lemma gives a sufficient and necessary condition. 


Lemma 16.2. A symmetric function K : X x X — R implements an inner product in 
some Hilbert space if and only if it is positive semidefinite; namely, for all x,,...,Xm, 
the Gram matrix, G;,; = K (xi,x;), is a positive semidefinite matrix. 


Proof. It is trivial to see that if K implements an inner product in some Hilbert 
space then the Gram matrix is positive semidefinite. For the other direction, define 
the space of functions over X as R* ={f : ¥ > R}. For each x € & let w(x) be 
the function x +> K(-,x). Define a vector space by taking all linear combinations of 
elements of the form K(-,x). Define an inner product on this vector space to be 


(Seek (on. Tova) = > aiB) K(%i-X;). 


This is a valid inner product since it is symmetric (because K is symmetric), it is 
linear (immediate), and it is positive definite (it is easy to see that K(x, x) > 0 with 
equality only for y(x) being the zero function). Clearly, 


(w(x), v(x’) = (K(-.X), K(-,x)) = K(x,x’), 


which concludes our proof. oO 


16.3 IMPLEMENTING SOFT-SVM WITH KERNELS 


Next, we turn to solving Soft-SVM with kernels. While we could have designed 
an algorithm for solving Equation (16.4), there is an even simpler approach that 
directly tackles the Soft-SVM optimization problem in the feature space, 


min (Jimi pp man 1—yw, 7) . (16.5) 


while only using kernel evaluations. The basic observation is that the vector w") 
maintained by the SGD procedure we have described in Section 15.5 is always in 
the linear span of {w(x1),...,W(Xm)}. Therefore, rather than maintaining w we 
can maintain the corresponding coefficients o. 

Formally, let K be the kernel function, namely, for all x,x’, K(x,x’) = 
(w(x), ¥(x’)). We shall maintain two vectors in R”, corresponding to two vectors 
9 and w”) defined in the SGD procedure of Section 15.5. That is, B“ will be a 
vector such that 


0 =S~ pO w(x;) (16.6) 
j=l 
and «“) be such that 
wl) = Sal W(x). (16.7) 


j=l 


The vectors B and a are updated according to the following procedure. 
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16.4 Summary 


SGD for Solving Soft-SVM with Kernels 


Goal: Solve Equation (16.5) 
parameter: T 
Initialize: B = 0 
fort =1,...,T7 
Let a = 1p 
Choose i uniformly at random from [m] 


For all j 4i set per) = ps 

IE (yi Ly aK (K;,xi) <1) 
Set pT) = pl 4+ y, 

Else 


Set prt) = pr? 
Output: w= )0""_, @jy(x;) where a = te a) 


The following lemma shows that the preceding implementation is equivalent to 
running the SGD procedure described in Section 15.5 on the feature space. 


Lemma 16.3. Let w be the output of the SGD procedure described in Section 15.5, 
when applied on the feature space, and let w = ee ajw(x;) be the output of 
applying SGD with kernels. Then w= W. 


Proof. We will show that for every t Equation (16.6) holds, where @“ is the result 
of running the SGD procedure described in Section 15.5 in the feature space. By the 
definition of oe = ip ) and wW = sol ), this claim implies that Equation (16.7) 
also holds, and the proof of our lemma will follow. To prove that Equation (16.6) 
holds we use a simple inductive argument. For ¢t = 1 the claim trivially holds. Assume 
it holds for t > 1. Then, 


Ji (w”, v(xi)) = Vj (Seles vs) = yi ya) K(x;,x:). 
7. j=l 


Hence, the condition in the two algorithms is equivalent and if we update 0 we have 


aC) = 9 + yith(x;) = SB wOxs) + Wei) = YP WR), 


j=l j=l 


which concludes our proof. O 


16.4 SUMMARY 


Mappings from the given domain to some higher dimensional space, on which a 
halfspace predictor is used, can be highly powerful. We benefit from a rich and 
complex hypothesis class, yet need to solve the problems of high sample and compu- 
tational complexities. In Chapter 10, we discussed the AdaBoost algorithm, which 
faces these challenges by using a weak learner: Even though we’re in a very high 
dimensional space, we have an “oracle” that bestows on us a single good coordinate 
to work with on each iteration. In this chapter we introduced a different approach, 
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the kernel trick. The idea is that in order to find a halfspace predictor in the high 
dimensional space, we do not need to know the representation of instances in that 
space, but rather the values of inner products between the mapped instances. Cal- 
culating inner products between instances in the high dimensional space without 
using their representation in that space is done using kernel functions. We have also 
shown how the SGD algorithm can be implemented using kernels. 

The ideas of feature mapping and the kernel trick allow us to use the framework 
of halfspaces and linear predictors for nonvectorial data. We demonstrated how 
kernels can be used to learn predictors over the domain of strings. 

We presented the applicability of the kernel trick in SVM. However, the ker- 
nel trick can be applied in many other algorithms. A few examples are given as 
exercises. 

This chapter ends the series of chapters on linear predictors and convex prob- 
lems. The next two chapters deal with completely different types of hypothesis 
classes. 


16.5 BIBLIOGRAPHIC REMARKS 


In the context of SVM, the kernel-trick has been introduced in Boser et al. (1992). 
See also Aizerman et al. (1964). The observation that the kernel-trick can be applied 
whenever an algorithm only relies on inner products was first stated by Schélkopf 
et al. (1998). The proof of the representer theorem is given in (Schdlkopf et al. 
2000, Schélkopf et al. 2001). The conditions stated in Lemma 16.2 are simplification 
of conditions due to Mercer. Many useful kernel functions have been introduced in 
the literature for various applications. We refer the reader to Schélkopf & Smola 
(2002). 


16.6 EXERCISES 


16.1 Consider the task of finding a sequence of characters in a file, as described in 
Section 16.2.1. Show that every member of the class H can be realized by composing 
a linear classifier over (x), whose norm is 1 and that attains a margin of 1. 

16.2 Kernelized Perceptron: Show how to run the Perceptron algorithm while only 
accessing the instances via the kernel function. Hint: The derivation is similar to 
the derivation of implementing SGD with kernels. 

16.3 Kernel Ridge Regression: The ridge regression problem, with a feature mapping 
w, is the problem of finding a vector w that minimizes the function 


1 m 
fw) =A wll? + Br A u(x;)) — yi)’, (16.8) 


and then returning the predictor 
h(x) = (w, x). 
Show how to implement the ridge regression algorithm with kernels. 


Hint: The representer theorem tells us that there exists a vector « € R” such that 
Soy a W(x;) is a minimizer of Equation (16.8). 
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16.6 Exercises 


1. Let G be the Gram matrix with regard to S and K. That is, Gj; = K (x;,x;). 
Define g:R” > R by 


m 


Haha Gat oY (eG) HP (16.9) 


where G.,; is the i’th column of G. Show that if «* minimizes Equation (16.9) 
then w* = )>;"_, a w(x;) is a minimizer of f. 

2. Find a closed form expression for a*. 

Let N be any positive integer. For every x,x’ € {1,..., N} define 


K(x, x’) = min{x, x’}. 


Prove that K is a valid kernel; namely, find a mapping yy: {1,..., VN} ~ H where H 
is some Hilbert space, such that 


¥x,x’€{1,...,N}, K(x,x’) = (W(x), W(x’)). 


A supermarket manager would like to learn which of his customers have babies on 
the basis of their shopping carts. Specifically, he sampled i.i.d. customers, where 
for customer i, let x; C {1,...,d} denote the subset of items the customer bought, 
and let y; € {1} be the label indicating whether this customer has a baby. As prior 
knowledge, the manager knows that there are k items such that the label is deter- 
mined to be 1 iff the customer bought at least one of these k items. Of course, the 
identity of these k items is not known (otherwise, there was nothing to learn). In 
addition, according to the store regulation, each customer can buy at most s items. 
Help the manager to design a learning algorithm such that both its time complexity 
and its sample complexity are polynomial in s,k, and 1/e. 
Let ¥ be an instance set and let w be a feature mapping of V into some Hilbert 
feature space V. Let K: ¥% x X > R be a kernel function that implements inner 
products in the feature space V. 

Consider the binary classification algorithm that predicts the label of an unseen 
instance according to the class with the closest average. Formally, given a training 
sequence S = (xj, y1),---, (Xm, Ym), for every y € {+1} we define 


1 
cy = oe 


where my, = |{i : y; = y}|. We assume that m, and m_ are nonzero. Then, the 
algorithm outputs the following decision rule: 


moo= {i Ie) — es SIV) — eo 

0 otherwise. 

1. Letw=c,—c_and let b= $(\c_|!? = llc |?). Show that 
h(x) = sign((w, w(x)) +). 


2. Show how to express /(x) on the basis of the kernel function, and without 
accessing individual entries of w(x) or w. 
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Multiclass, Ranking, and Complex 
Prediction Problems 


Multiclass categorization is the problem of classifying instances into one of several 
possible target classes. That is, we are aiming at learning a predictor h: ¥ > ), 
where ) is a finite set of categories. Applications include, for example, categorizing 
documents according to topic (¥ is the set of documents and ) is the set of possible 
topics) or determining which object appears in a given image (4 is the set of images 
and y is the set of possible objects). 

The centrality of the multiclass learning problem has spurred the development of 
various approaches for tackling the task. Perhaps the most straightforward approach 
is a reduction from multiclass classification to binary classification. In Section 17.1 
we discuss the most common two reductions as well as the main drawback of the 
reduction approach. 

We then turn to describe a family of linear predictors for multiclass problems. 
Relying on the RLM and SGD frameworks from previous chapters, we describe 
several practical algorithms for multiclass prediction. 

In Section 17.3 we show how to use the multiclass machinery for complex pre- 
diction problems in which Y can be extremely large but has some structure on it. 
This task is often called structured output learning. In particular, we demonstrate 
this approach for the task of recognizing handwritten words, in which Y is the set of 
all possible strings of some bounded length (hence, the size of is exponential in 
the maximal length of a word). 

Finally, in Section 17.4 and Section 17.5 we discuss ranking problems in which 
the learner should order a set of instances according to their “relevance.” A typical 
application is ordering results of a search engine according to their relevance to the 
query. We describe several performance measures that are adequate for assessing 
the performance of ranking predictors and describe how to learn linear predictors 
for ranking problems efficiently. 


17.1 ONE-VERSUS-ALL AND ALL-PAIRS 


The simplest approach to tackle multiclass prediction problems is by reduc- 
tion to binary classification. Recall that in multiclass prediction we would like 
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to learn a function h : YX —> Y. Without loss of generality let us denote 
Y={l,...,k}. 

In the One-versus-All method (a.k.a. One-versus-Rest) we train k binary classi- 
fiers, each of which discriminates between one class and the rest of the classes. That 
is, given a training set S = (x1, y1),--., (Xm, ¥m), Where every y; is in Y, we construct 
k binary training sets, S,,...,.S,, where S; = (x1, (— 1)'b 4). 0.2, (Xn, (- 1)"bm#il), In 
words, 5S; is the set of instances labeled 1 if their label in S was i, and —1 otherwise. 
For every i € [k] we train a binary predictor h; : ¥ — {+1} based on S;, hoping that 
h;(x) should equal 1 if and only if x belongs to class i. Then, given /j,...,h,%, we 
construct a multiclass predictor using the rule 


h(x) € argmaxh;(x). (17.1) 
ie[k] 


When more than one binary hypothesis predicts “1” we should somehow decide 
which class to predict (e.g., we can arbitrarily decide to break ties by taking the 
minimal index in argmax; h;(x)). A better approach can be applied whenever each 
h; hides additional information, which can be interpreted as the confidence in the 
prediction y =i. For example, this is the case in halfspaces, where the actual predic- 
tion is sign({w, x)), but we can interpret (w, x) as the confidence in the prediction. 
In such cases, we can apply the multiclass rule given in Equation (17.1) on the real 
valued predictions. A pseudocode of the One-versus-All approach is given in the 
following. 


One-versus-All 


input: 
training set S = (x1, y1),.--, (Km, Ym) 
algorithm for binary classification A 


foreach i € V 
let S; = (x1, (—1)"1#4),..., (Km, (— 1)"bmi) 
let h; = A(S;) 
output: 
the multiclass hypothesis defined by h(x) € argmax,_,)/;(x) 


Another popular reduction is the All-Pairs approach, in which all pairs of classes 
are compared to each other. Formally, given a training set S = (x1, y1),---,(Km,Ym)> 
where every y; is in [k], for every 1 <i < j <k we construct a binary training 
sequence, S;,;, containing all examples from S$ whose label is either i or j. For each 
such an example, we set the binary label in S;,; to be +1 if the multiclass label in 
S is i and —1 if the multiclass label in S is 7. Next, we train a binary classification 
algorithm based on every S;,; to get h;,;. Finally, we construct a multiclass classifier 
by predicting the class that had the highest number of “wins.” A pseudocode of the 
All-Pairs approach is given in the following. 
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All-Pairs 


input: 
training set S = (x1, y1),.--, (Km, Ym) 
algorithm for binary classification A 
foreach i,j € Vs.t.i<j 
initialize S; ; to be the empty sequence 
fort=1,...,m 


If y, =i add (x;, 1) to Si, j 
If y, = j add (x;, —1) to Si, j 
let hj; = A(Si,;) 
output: 
the multiclass hypothesis defined by 


h(x) € argmax; -y) (Xjey sign — i)hij(®)) 


Although reduction methods such as the One-versus-All and All-Pairs are sim- 
ple and easy to construct from existing algorithms, their simplicity has a price. The 
binary learner is not aware of the fact that we are going to use its output hypotheses 
for constructing a multiclass predictor, and this might lead to suboptimal results, as 
illustrated in the following example. 


Example 17.1. Consider a multiclass categorization problem in which the instance 
space is ¥ = R* and the label set is Y = {1, 2, 3}. Suppose that instances of the 
different classes are located in nonintersecting balls as depicted in the following. 


Loo< 


— - > 


Suppose that the probability masses of classes 1, 2, 3 are 40%, 20%, and 40%, 
respectively. Consider the application of One-versus-All to this problem, and 
assume that the binary classification algorithm used by One-versus-All is ERM with 
respect to the hypothesis class of halfspaces. Observe that for the problem of dis- 
criminating between class 2 and the rest of the classes, the optimal halfspace would 
be the all negative classifier. Therefore, the multiclass predictor constructed by One- 
versus-All might err on all the examples from class 2 (this will be the case if the tie in 
the definition of (x) is broken by the numerical value of the class label). In contrast, 
4,5). = (0,0). and ws= (J) 
then the classifier defined by A(x) = argmax; h;(x) perfectly predicts all the exam- 
ples. We see that even though the approximation error of the class of predictors of 
the form h(x) = argmax;, (w;,x) is zero, the One-versus-All approach might fail to 
find a good predictor from this class. 


if we choose h;(x) = (w;,x), where w; = ( 
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17.2 LINEAR MULTICLASS PREDICTORS 


In light of the inadequacy of reduction methods, in this section we study a more 
direct approach for learning multiclass predictors. We describe the family of linear 
multiclass predictors. To motivate the construction of this family, recall that a linear 
predictor for binary classification (i.e., a halfspace) takes the form 


h(x) = sign((w,x)). 
An equivalent way to express the prediction is as follows: 


h(x) = argmax (w, yx), 
ye{+l} 


where yx is the vector obtained by multiplying each element of x by y. 

This representation leads to a natural generalization of halfspaces to multiclass 
problems as follows. Let W : XY x Y — R¢ be a class-sensitive feature mapping. That 
is, W takes as input a pair (x, y) and maps it into a d dimensional feature vector. 
Intuitively, we can think of the elements of Y(x, y) as score functions that assess 
how well the label y fits the instance x. We will elaborate on W later on. Given V 
and a vector w € R¢, we can define a multiclass predictor, h : ¥ — J, as follows: 

h(x) = argmax (w, (x, y)). 
yey 
That is, the prediction of h for the input x is the label that achieves the highest 
weighted score, where weighting is according to the vector w. 

Let W be some set of vectors in R¢, for example, W = {w € R¢: ||w]] < B}, 
for some scalar B > 0. Each pair (W, W) defines a hypothesis class of multiclass 
predictors: 

Hw,w = {xt argmax(w, U(x, y)) : we Wh. 
yey 


Of course, the immediate question, which we discuss in the sequel, is how to con- 
struct a good W. Note that if Y = {+1} and we set W(x, y) = yx and W = R’%, then 
Hw,w becomes the hypothesis class of homogeneous halfspace predictors for binary 
classification. 


17.2.1 How to Construct 


As mentioned before, we can think of the elements of W(x, y) as score functions 
that assess how well the label y fits the instance x. Naturally, designing a good V 
is similar to the problem of designing a good feature mapping (as we discussed in 
Chapter 16 and as we will discuss in more detail in Chapter 25). Two examples of 
useful constructions are given in the following. 


The Multivector Construction: 
Let Y = {1,...,k} and let X = R”. We define VW: ¥ x Y > R®@, where d = nk, as 
follows 


W(x, y)=[ 0,...,0, x1,...,4n, 0,...,0 J. (17.2) 
ERO-1)n ER" eRKk-y)n 
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That is, W(x, y) is composed of k vectors, each of which is of dimension n, where 
we set all the vectors to be the all zeros vector except the y’th vector, which is set 
to be x. It follows that we can think of w ¢ R™ as being composed of k weight 
vectors in R”, that is, w=[w1; ... ; we], hence the name multivector construction. 
By the construction we have that (w, U(x, y)) = (wy,x), and therefore the multiclass 
prediction becomes 
h(x) = argmax (wy,X). 
yey 

A geometric illustration of the multiclass prediction over 4 = R? is given in the 
following. 


TF-IDF: 

The previous definition of Y(x, y) does not incorporate any prior knowledge about 
the problem. We next describe an example of a feature function W that does incor- 
porate prior knowledge. Let V be a set of text documents and Y be a set of possible 
topics. Let d be a size of a dictionary of words. For each word in the dictionary, 
whose corresponding index is j, let TF(j,x) be the number of times the word cor- 
responding to j appears in the document x. This quantity is called Term-Frequency. 
Additionally, let DF(j,y) be the number of times the word corresponding to j 
appears in documents in our training set that are not about topic y. This quantity 
is called Document-Frequency and measures whether word j is frequent in other 
topics. Now, define VW: V¥ x Y> R¢ to be such that 


W(x, y) = TF(j,x) log (atm) 


where m is the total number of documents in our training set. The preceding quantity 
is called term-frequency-inverse-document-frequency or TF-IDF for short. Intu- 
itively, Y;(x, y) should be large if the word corresponding to j appears a lot in the 
document x but does not appear at all in documents that are not on topic y. If this 
is the case, we tend to believe that the document x is on topic y. Note that unlike 
the multivector construction described previously, in the current construction the 
dimension of W does not depend on the number of topics (i.e., the size of V). 


17.2.2 Cost-Sensitive Classification 


So far we used the zero-one loss as our performance measure of the quality of h(x). 
That is, the loss of a hypothesis / on an example (x, y) is 1 if A(x) 4 y and 0 other- 
wise. In some situations it makes more sense to penalize different levels of loss for 
different mistakes. For example, in object recognition tasks, it is less severe to pre- 
dict that an image of a tiger contains a cat than predicting that the image contains a 
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Since hy(x) € Y we can upper bound the right-hand side of the preceding by 


max (A(y',y) + (Ww, YO y')— WO) Seow, y)). (17.3) 


We use the term “generalized hinge loss” to denote the preceding expression. As we 
have shown, (w, (x, y)) > A(fw(x), y). Furthermore, equality holds whenever the 
score of the correct label is larger than the score of any other label, y’, by at least 
A(y’, y), namely, 


Vy'eV\ fy}, (w,Y (x, y)) = (w, U(x, y’)) + AG”, y). 


It is also immediate to see that €(w, (x, y)) is a convex function with respect to w 
since it is a maximum over linear functions of w (see Claim 12.5 in Chapter 12), and 
that €(w, (x, y)) is p-Lipschitz with p = maxy/cy || V(x, y’) — (x, y)]. 


Remark 17.2. We use the name “generalized hinge loss” since in the binary case, 
when ) = {+1}, if we set U(x, y) = +, then the generalized hinge loss becomes the 
vanilla hinge loss for binary classification, 


£(w, (x, y)) = max{0, 1 — y(w,x)}. 


Geometric Intuition: 

The feature function W : ¥ x Y > R¢ maps each x into || vectors in R¢. The value of 
£(w, (x, y)) will be zero if there exists a direction w such that when projecting the |V| 
vectors onto this direction we obtain that each vector is represented by the scalar 
(w, U(x, y)), and we can rank the different points on the basis of these scalars so 
that 


™ The point corresponding to the correct y is top-ranked 

™ For each y’ 4 y, the difference between (w, V(x, y)) and (w, W(x, y’)) is larger 
than the loss of predicting y’ instead of y. The difference (w, W(x, y)) — 
(w, W(x, y’)) is also referred to as the “margin” (see Section 15.1). 


This is illustrated in the figure following: 


17.2.5 Multiclass SVM and SGD 


Once we have defined the generalized hinge loss, we obtain a convex-Lipschitz 
learning problem and we can apply our general techniques for solving such 
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problems. In particular, the RLM technique we have studied in Chapter 13 yields 
the multiclass SVM rule: 


Multiclass SVM 


input: (x1, y1),---, (Xm, Ym) 
parameters: 

regularization parameter A > 0 

loss function A: Y x Y—> Ry 

class-sensitive feature mapping VW : X x Y > R4 
solve: 


1 m 
min { Al|w||?+— max (A(y’, y;) + (w, U(x;, y’) — U(x, yi 
ein (1 I pe (y' yi) + (w, W(x, y’) — W(x, ¥i))) 


output the predictor hy(x) = argmax,_4)(w, V(x, y)) 


We can solve the optimization problem associated with multiclass SVM 
using generic convex optimization algorithms (or using the method described in 
Section 15.5). Let us analyze the risk of the resulting hypothesis. The analysis 
seamlessly follows from our general analysis for convex-Lipschitz problems given 
in Chapter 13. In particular, applying Corollary 13.8 and using the fact that the gen- 
eralized hinge loss upper bounds the A loss, we immediately obtain an analog of 
Corollary 15.7: 


Corollary 17.1. Let D be a distribution over X x y, let V: X x Y > R4, and assume 
that for allx € X and y € Y we have ||W(x, y)|| < o/2. Let B > 0. Consider running 


Multiclass SVM with d = 4/ 25 on @ training set S~D" and let hy be the output of 
Multiclass SVM. Then, 


hi 802 B2 
1 LA i; 2 ‘i Le hinge 2 ain LS hinge 
En lEdw)] = EB LEp  (w)] Ss jmin, Lp) +] 


where L&(h) = Ey,)~p [A(h(x), y)] and LW) = Evy) [€(w, (&, y))] with & 
being the generalized hinge-loss as defined in Equation (17.3). 


We can also apply the SGD learning framework for minimizing L5, BME (w) 
as described in Chapter 14. Recall Claim 14.6, which dealt with subgradients 
of max functions. In light of this claim, in order to find a subgradient of the 
generalized hinge loss all we need to do is to find y € ) that achieves the max- 
imum in the definition of the generalized hinge loss. This yields the following 
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algorithm: 


SGD for Multiclass Learning 


parameters: 

Scalar 7 > 0, integer T > 0 

loss function A: x Y—> Ry 

class-sensitive feature mapping V : ¥ x ) > R¢ 
initialize: w") = 0 ¢ R4 


fort =1, 2,...,T 
sample (x, y)~ D 
find $ € argmaxy cy (A(y’, y) + (WO, U(x, y’) — W(x, y))) 
set v; = W(x, }) — W(x, y) 
update wt) = w — ny, 
output w = + yw 


Our general analysis of SGD given in Corollary 14.12 immediately implies: 


Corollary 17.2. Let D be a distribution over X x y, let ¥: X x Y > R4, and assume 
that for allx € X and y € Y we have ||V(x, y)|| < p/2. Let B > 0. Then, for every 
€ > 0, if we run SGD for multiclass learning with a number of iterations (i.e., number 
of examples) 

: B22 
= 


and with n = ,/ 5, then the output of SGD satisfies 


T 


x LA hs < 7 Le hinge ~ < : Le hinge 
EnlEdtiw] SE [Ep (W)) Ss min Lp (u) +e 


Remark 17.3. Itis interesting to note that the risk bounds given in Corollary 17.1 and 
Corollary 17.2 do not depend explicitly on the size of the label set V, a fact we will 
rely on in the next section. However, the bounds may depend implicitly on the size 
of Y via the norm of (x, y) and the fact that the bounds are meaningful only when 
g—hinge 
D 


there exists some vector u, ||u|| < B, for which L (u) is not excessively large. 


17.3 STRUCTURED OUTPUT PREDICTION 


Structured output prediction problems are multiclass problems in which ¥ is very 
large but is endowed with a predefined structure. The structure plays a key role in 
constructing efficient algorithms. To motivate structured learning problems, con- 
sider the problem of optical character recognition (OCR). Suppose we receive an 
image of some handwritten word and would like to predict which word is written in 
the image. To simplify the setting, suppose we know how to segment the image into 
a sequence of images, each of which contains a patch of the image corresponding 
to a single letter. Therefore, V is the set of sequences of images and J is the set 
of sequences of letters. Note that the size of Y grows exponentially with the max- 
imal length of a word. An example of an image x corresponding to the label y = 
“workable” is given in the following. 
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To tackle structure prediction we can rely on the family of linear predictors 
described in the previous section. In particular, we need to define a reasonable loss 
function for the problem, A, as well as a good class-sensitive feature mapping, W. 
By “good” we mean a feature mapping that will lead to a low approximation error 
for the class of linear predictors with respect to YW and A. Once we do this, we can 
rely, for example, on the SGD learning algorithm defined in the previous section. 

However, the huge size of Y poses several challenges: 


1. To apply the multiclass prediction we need to solve a maximization problem 
over Y. How can we predict efficiently when Y is so large? 

2. How do we train w efficiently? In particular, to apply the SGD rule we again 
need to solve a maximization problem over ). 

3. How can we avoid overfitting? 


In the previous section we have already shown that the sample complexity of 
learning a linear multiclass predictor does not depend explicitly on the number of 
classes. We just need to make sure that the norm of the range of W is not too large. 
This will take care of the overfitting problem. To tackle the computational chal- 
lenges we rely on the structure of the problem, and define the functions W and A so 
that calculating the maximization problems in the definition of hw and in the SGD 
algorithm can be performed efficiently. In the following we demonstrate one way to 
achieve these goals for the OCR task mentioned previously. 

To simplify the presentation, let us assume that all the words in Y are of length 
r and that the number of different letters in our alphabet is g. Let y and y’ be two 
words (i.e., sequences of letters) in VY. We define the function A(y’,y) to be the 
average number of letters that are different in y’ and y, namely, 4 > Uy 4y!]- 

Next, let us define a class-sensitive feature mapping W(x, y). It will be convenient 
to think about x as a matrix of size n x r, where n is the number of pixels in each 
image, and r is the number of images in the sequence. The j’th column of x corre- 
sponds to the j’th image in the sequence (encoded as a vector of gray level values 
of pixels). The dimension of the range of W is set to be d =nq+q’. 

The first ng feature functions are “type 1” features and take the form: 


i 
Wi ji Y) = - St Uy,=))- 
t=1 


That is, we sum the value of the i’th pixel only over the images for which y assigns 
the letter j. The triple index (i, j, 1) indicates that we are dealing with feature (i, j) 
of type 1. Intuitively, such features can capture pixels in the image whose gray level 
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values are indicative of a certain letter. The second type of features take the form 


. 
Wij, 20%Y) = : Mpa Mp a=a 
t=2 
That is, we sum the number of times the letter i follows the letter j. Intuitively, 
these features can capture rules like “It is likely to see the pair ‘qu’ in a word” or “It 
is unlikely to see the pair ‘rz’ in a word.” Of course, some of these features will not 
be very useful, so the goal of the learning process is to assign weights to features by 
learning the vector w, so that the weighted score will give us a good prediction via 


hw(x) = argmax (w, U(x, y)). 
yey 


It is left to show how to solve the optimization problem in the definition of hw(x) 
efficiently, as well as how to solve the optimization problem in the definition of } in 
the SGD algorithm. We can do this by applying a dynamic programming procedure. 
We describe the procedure for solving the maximization in the definition of hy and 
leave as an exercise the maximization problem in the definition of » in the SGD 
algorithm. 

To derive the dynamic programming procedure, let us first observe that we can 
write 


(x,y) = > 4(K 7-1): 


t=1 


for an appropriate @ : X x [q] x [q] U {0} > R¢, and for simplicity we assume that yo 
is always equal to 0. Indeed, each feature function W;,;,; can be written in terms of 


$3, ;,1(%, Yt> yr—-1) =Xi,t y=; 
while the feature function W;,;,2 can be written in terms of 
Gi. j 20% Yes Ye-1) = Uy, =) Up,_s=)- 


Therefore, the prediction can be written as 
hw(x) = argmax S°(w, $(X, yi, ¥r-1)). (17.4) 
ye t=1 


In the following we derive a dynamic programming procedure that solves every 
problem of the form given in Equation (17.4). The procedure will maintain a matrix 
M €R*" such that 


T 


Moe = max > (w,(X, yr, ¥e-1))- 


Clearly, the maximum of (w, (x,y)) equals max; M,,. Furthermore, we can 
calculate M in a recursive manner: 


Ms,. = max (My 7-1 + (w, 6(x, 5,5’))). (17,5) 
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This yields the following procedure: 


Dynamic Programming for Calculating /,,(x) as Given in 
Equation (17.4) 


input: a matrix x € R”” and a vector w 
initialize: 
foreach s € [q] 
Ms = (Ww, (x, S, =1)) 
for t =2,...,7r 


foreach s € [q] 
set M, , as in Equation (17.5) 
set I, ; to be the s’ that maximizes Equation (17.5) 
set y; = argmax, M, , 
fort =r,r—1,...,2 
set y71= Tt 
output: y = ()1,..., yr) 


17.4 RANKING 


Ranking is the problem of ordering a set of instances according to their “rele- 
vance.” A typical application is ordering results of a search engine according to their 
relevance to the query. Another example is a system that monitors electronic trans- 
actions and should alert for possible fraudulent transactions. Such a system should 
order transactions according to how suspicious they are. 

Formally, let %* =U, ¥” be the set of all sequences of instances from ¥ of 
arbitrary length. A ranking hypothesis, 4, is a function that receives a sequence of 
instances x = (x;,...,x-) € ¥*, and returns a permutation of [r]. It is more conve- 
nient to let the output of h be a vector y € R’, where by sorting the elements of y 
we obtain the permutation over [r]. We denote by z(y) the permutation over [r] 
induced by y. For example, for r =5, the vector y= (2, 1, 6,—1, 0.5) induces the 
permutation z(y) = (4, 3, 5, 1, 2). That is, if we sort y in an ascending order, then 
we obtain the vector (—1, 0.5, 1, 2, 6). Now, z(y); is the position of y; in the sorted 
vector (— 1, 0.5, 1, 2, 6). This notation reflects that the top-ranked instances are 
those that achieve the highest values in z(y). 

In the notation of our PAC learning model, the examples domain is Z = 
Un, (#" x R’), and the hypothesis class, 1, is some set of ranking hypotheses. We 
next turn to describe loss functions for ranking. There are many possible ways to 
define such loss functions, and here we list a few examples. In all the examples we 
define ¢(h, (x, y)) = A(A(x), y), for some function A: UP, (R” x R") > Ry. 


® 0-1 Ranking loss: A(y’,y) is zero if y and y’ induce exactly the same ranking 
and A(y’,y) = 1 otherwise. That is, A(y’, y) = I,(y)\zn(yy]- Such a loss function is 
almost never used in practice as it does not distinguish between the case in which 
(y’) is almost equal to z(y) and the case in which z(y’) is completely different 
from z(y). 
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® Kendall-Tau Loss: We count the number of pairs (i, j) that are in different order 
in the two permutations. This can be written as 


r 


2 r-1 
Alyy) =——~ DDS Usign(yt—y4 )4sien(y,-y)} 
r(r—1) i=l j=i41 : 


This loss function is more useful than the 0-1 loss as it reflects the level of 
similarity between the two rankings. 

® Normalized Discounted Cumulative Gain (NDCG): This measure emphasizes 
the correctness at the top of the list by using a monotonically nondecreasing 
discount function D: N > R,. We first define a discounted cumulative gain 
measure: 


G(y.y)= > Dirly’):) vi. 
i=l 


In words, if we interpret y; as a score of the “true relevance” of item 7, then 
we take a weighted sum of the relevance of the elements, while the weight of 
y; is determined on the basis of the position of i in z(y’). Assuming that all 
elements of y are nonnegative, it is easy to verify that 0 < G(y’,y) < Gly, y). 
We can therefore define a normalized discounted cumulative gain by the ratio 
G(y’, y)/G(y, y), and the corresponding loss function would be 


_Giysy)_ 1 
Gty.y) Gly.y) 


A(yy)=1 S> (D(x (y);) — D(e(y’)i)) vi- 
i=l 


We can easily see that A(y’,y) € [0,1] and that A(y’,y) = 0 whenever 


my’) =1(y). 
A typical way to define the discount function is by 


1 or 
D(i) = 4 PRC) Me Neila) 
0 otherwise 


where k <r is a parameter. This means that we care more about elements that 
are ranked higher, and we completely ignore elements that are not at the top-k 
ranked elements. The NDCG measure is often used to evaluate the performance 
of search engines since in such applications it makes sense completely to ignore 
elements that are not at the top of the ranking. 


Once we have a hypothesis class and a ranking loss function, we can learn a 
ranking function using the ERM rule. However, from the computational point of 
view, the resulting optimization problem might be hard to solve. We next discuss 
how to learn linear predictors for ranking. 


17.4.1 Linear Predictors for Ranking 


A natural way to define a ranking function is by projecting the instances onto some 
vector w and then outputting the resulting scalars as our representation of the rank- 
ing function. That is, assuming that VY C R?, for every w € R@ we define a ranking 
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Let us denote W(x, v) = )>;_, v;x;; it follows that 


1 (hw(X)) = argmax } > vu; (W, X;) 


veV | 
7 
= argmax( W, ) Uj; Xi 
veV p= 


= argmax(w, W(x, v)). 
veV 


On the basis of this observation, we can use the generalized hinge loss for cost- 
sensitive multiclass classification as a surrogate loss function for the NDCG loss as 
follows: 


A(hw(X),y) < A(hw(X).y) + (w, U(X, 7 (hw(X)))) — (w, Y(X, (y))) 


< max [A(v,y) + (w, U(x, v)) — (w, U(x, (y)))] 


= max ACs) + On) imaa ; (17.8) 


The right-hand side is a convex function with respect to w. 

We can now solve the learning problem using SGD as described in 
Section 17.2.5. The main computational bottleneck is calculating a subgradient of 
the loss function, which is equivalent to finding v that achieves the maximum in 
Equation (17.8) (see Claim 14.6). Using the definition of the NDCG loss, this is 
equivalent to solving the problem 


argmin x (ajv; + Bi D(wi)), 


veV i= 1 


where a; = —(w,x;) and 6; = y;/G(y,y). We can think of this problem a little bit 
differently by defining a matrix A € R”” where 


Ai, = ja; + D(j) Bi. 


Now, let us think about each j as a “worker,” eachi as a “task,” and A;,; as the cost 
of assigning task i to worker j. With this view, the problem of finding v becomes 
the problem of finding an assignment of the tasks to workers of minimal cost. This 
problem is called “the assignment problem” and can be solved efficiently. One par- 
ticular algorithm is the “Hungarian method” (Kuhn 1955). Another way to solve 
the assignment problem is using linear programming. To do so, let us first write the 
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assignment problem as 


argmin > Ai, ; Bi,j (17.9) 
BeR! i,j=l 


s.t. Wi€ [r], S Big =1 


j=l 


Vi, j, Bij € {0,1} 


A matrix B that satisfies the constraints in the preceding optimization problem is 
called a permutation matrix. This is because the constraints guarantee that there is 
at most a single entry of each row that equals 1 and a single entry of each column 
that equals 1. Therefore, the matrix B corresponds to the permutation v € V defined 
by v; = j for the single index j that satisfies B;,; = 1. 

The preceding optimization is still not a linear program because of the combina- 
torial constraint B;,; € {0, 1}. However, as it turns out, this constraint is redundant — 
if we solve the optimization problem while simply omitting the combinatorial 
constraint, then we are still guaranteed that there is an optimal solution that will 
satisfy this constraint. This is formalized later. 

Denote (A, B) =)7; ; Ai, Bi,;. Then, Equation (17.9) is the problem of minimiz- 
ing (A, B) such that B is a permutation matrix. 

A matrix B € R”” is called doubly stochastic if all elements of B are nonnegative, 
the sum of each row of B is 1, and the sum of each column of B is 1. Therefore, 
solving Equation (17.9) without the constraints B;,; € {0, 1} is the problem 


argmin(A, B) s.t. B is a doubly stochastic matrix. (17.10) 
BeR" 


The following claim states that every doubly stochastic matrix is a convex 
combination of permutation matrices. 


Claim 17.3 (Birkhoff 1946, Von Neumann 1953). The set of doubly stochastic 
matrices in R"" is the convex hull of the set of permutation matrices in R™". 


On the basis of the claim, we easily obtain the following: 


Lemma 17.4. There exists an optimal solution of Equation (17.10) that is also an 
optimal solution of Equation (17.9). 


Proof. Let B be a solution of Equation (17.10). Then, by Claim 17.3, we can write 
B=)°0;viCi, where each C; is a permutation matrix, each yj; > 0, and }0, yj; = 1. 
Since all the C; are also doubly stochastic, we clearly have that (A, B) < (A, C;) for 
every i. We claim that there is some i for which (A, B) = (A, C;). This must be true 
since otherwise, if for every i (A, B) < (A, C;), we would have that 


(A, B) = (4.Dne;] =o vitA,Ci) > SoA, B) = (A, B), 
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which cannot hold. We have thus shown that some permutation matrix, C;, sat- 
isfies (A, B) = (A,C;). But, since for every other permutation matrix C we have 
(A, B) < (A, C) we conclude that C; is an optimal solution of both Equation (17.9) 
and Equation (17.10). O 


17.5 BIPARTITE RANKING AND MULTIVARIATE 
PERFORMANCE MEASURES 


In the previous section we described the problem of ranking. We used a vector 
y € R’ for representing an order over the elements x;,...,x;. If all elements in y 
are different from each other, then y specifies a full order over [r]. However, if two 
elements of y attain the same value, y; = y; fori # j, then y can only specify a partial 
order over [r]. In such a case, we say that x; and x; are of equal relevance according 
to y. In the extreme case, y € {+1}’, which means that each x; is either relevant 
or nonrelevant. This setting is often called “bipartite ranking.” For example, in the 
fraud detection application mentioned in the previous section, each transaction is 
labeled as either fraudulent (y; = 1) or benign (y; = —1). 

Seemingly, we can solve the bipartite ranking problem by learning a binary clas- 
sifier, applying it on each instance, and putting the positive ones at the top of the 
ranked list. However, this may lead to poor results as the goal of a binary learner 
is usually to minimize the zero-one loss (or some surrogate of it), while the goal of 
a ranker might be significantly different. To illustrate this, consider again the prob- 
lem of fraud detection. Usually, most of the transactions are benign (say 99.9%). 
Therefore, a binary classifier that predicts “benign” on all transactions will have a 
zero-one error of 0.1%. While this is a very small number, the resulting predictor is 
meaningless for the fraud detection application. The crux of the problem stems from 
the inadequacy of the zero-one loss for what we are really interested in. A more ade- 
quate performance measure should take into account the predictions over the entire 
set of instances. For example, in the previous section we have defined the NDCG 
loss, which emphasizes the correctness of the top-ranked items. In this section we 
describe additional loss functions that are specifically adequate for bipartite ranking 
problems. 

As in the previous section, we are given a sequence of instances, x = (x},...,X;), 
and we predict a ranking vector y’ € R’. The feedback vector is y € {+1}". We define 
a loss that depends on y’ and y and depends on a threshold 6 € R. This threshold 
transforms the vector y’ € R’ into the vector (sign(y; — @),..., sign(y; — @)) € {£1}’. 
Usually, the value of 0 is set to be 0. However, as we will see, we sometimes set 0 
while taking into account additional constraints on the problem. 

The loss functions we define in the following depend on the following 4 numbers: 


True positives: a = |{i: y; =+1Asign(y; — 0) = +1}| 
False positives: b = |{i: y; =—1Asign(y; —6) =+1}| 
p | 2 e (yj, —@) (17.11) 
False negatives: c = |{i: yj =+1Asign(y; —@) = —1}| 
True negatives: d= |{i: y; =—1Asign(y; — 0) = —1}| 
The recall (a.k.a. sensitivity) of a prediction vector is the fraction of true positives 


y’ “catches,” namely, 74. The precision is the fraction of correct predictions among 
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a 
a+b* 


negatives that our predictor “catches,” namely, <4. 

Note that as we decrease @ the recall increases (attaining the value 1 when 6 = 
—oo). On the other hand, the precision and the specificity usually decrease as we 
decrease 9. Therefore, there is a tradeoff between precision and recall, and we can 
control it by changing 0. The loss functions defined in the following use various 
techniques for combining both the precision and recall. 


the positive labels we predict, namely, The specificity is the fraction of true 


™ Averaging sensitivity and specificity: This measure is the average of the sensitiv- 


a 
a+c 


examples averaged with the accuracy on negative examples. Here, we set 6 = 0 


ity and specificity, namely, 5 ( + ws): This is also the accuracy on positive 


2 \ ate 


F\-score: The F; score is the harmonic mean of the precision and recall: 


—_—_2——. Its maximal value (of 1) is obtained when both precision and recall 


Precision + Recal — . . : 
are 1, and its minimal value (of 0) is obtained whenever one of them is 0 (even 


if the other one is 1). The F; score can be written using the numbers a, b, c 


as follows; F, = st. Again, we set 6 = 0, and the loss function becomes 


A(y, y)=1-Fi. 
™ Fp-score: It is like F, score, but we attach £7 times more importance to 
1+, 


and the corresponding loss function is A(y’,y) =1—4 ( is is): 
E 


recall than to precision, that is, — st. It can also be written as Fg = 
: Precision +P° Recall 
(1+B*)a : = ‘ ey 
ap atb+ Be Again, we set 0 = 0, and the loss function becomes A(y’,y) = 
1— Fp. 


MH Recall at k: We measure the recall while the prediction must contain at most k 
positive labels. That is, we should set 6 so that a+ b <k. This is convenient, for 
example, in the application of a fraud detection system, where a bank employee 
can only handle a small number of suspicious transactions. 

M Precision at k: We measure the precision while the prediction must contain at 

least k positive labels. That is, we should set 6 so thata+b>k. 


The measures defined previously are often referred to as multivariate perfor- 
mance measures. Note that these measures are highly different from the average 
zero-one loss, which in the preceding notation equals pid In the aforemen- 
tioned example of fraud detection, when 99.9% of the examples are negatively 
labeled, the zero-one loss of predicting that all the examples are negatives is 0.1%. 
In contrast, the recall of such prediction is 0 and hence the F; score is also 0, which 


means that the corresponding loss will be 1. 


17.5.1 Linear Predictors for Bipartite Ranking 


We next describe how to train linear predictors for bipartite ranking. As in the 
previous section, a linear predictor for ranking is defined to be 


hw (X) = ((W,X1),---5 (W,Xr)). 


The corresponding loss function is one of the multivariate performance measures 
described before. The loss function depends on y’ = hyw(x) via the binary vector it 
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of A is fixed so we only need to maximize the expression 


x 
max vu; (W, X;). 
ve Va,b 54 


Suppose the examples are sorted so that (w,x1) > --- > (w,x;). Then, it is easy to 
verify that we would like to set v; to be positive for the smallest indices i. Doing 
this, with the constraint on a,b, amounts to setting v; = 1 for the a top ranked posi- 
tive examples and for the b top-ranked negative examples. This yields the following 
procedure. 


Solving Equation (17.14) 


input: 

(x1,---,X;),(91,---, 7), W,V, A 
assumptions: 

A is a function of a, b, c, d 

V contains all vectors for which f(a,b) =1 for some function f 
initialize: 

P=({i:y=YlN=li:y=—-V 

j= ((W,X1),--., (W,X;)), a = —00 

sort examples so that 4 > 2>---> Ur 

let i1,...,ip be the (sorted) indices of the positive examples 


let j1,..., jy be the (sorted) indices of the negative examples 
fora=0,1,...,P 
c=P-a 


for b=0,1,...,N such that f(a,b)=1 

d=N-—b 

calculate A using a, b, c, d 

set 01,...,U, S.t. Uj, So = Ui = Vj, =H SHV =1 
and the rest of the elements of v equal —1 

seta=A+)>)_4 Vifi 

if a > a* 
a*=a,v.=Vv 

output v* 


17.6 SUMMARY 


Many real world supervised learning problems can be cast as learning a multiclass 
predictor. We started the chapter by introducing reductions of multiclass learning 
to binary learning. We then described and analyzed the family of linear predictors 
for multiclass learning. We have shown how this family can be used even if the 
number of classes is extremely large, as long as we have an adequate structure on 
the problem. Finally, we have described ranking problems. In Chapter 29 we study 
the sample complexity of multiclass learning in more detail. 
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17.7 BIBLIOGRAPHIC REMARKS 


The One-versus-All and All-Pairs approach reductions have been unified under the 
framework of Error Correction Output Codes (ECOC) (Dietterich & Bakiri 1995, 
Allwein, Schapire & Singer 2000). There are also other types of reductions such 
as tree-based classifiers (see, for example, Beygelzimer, Langford & Ravikumar 
(2007)). The limitations of reduction techniques have been studied in (Daniely et 
al. 2011, Daniely et al. 2012). See also Chapter 29, in which we analyze the sample 
complexity of multiclass learning. 

Direct approaches to multiclass learning with linear predictors have been studied 
in (Vapnik 1998, Weston & Watkins 1999, Crammer & Singer 2001). In particular, 
the multivector construction is due to Crammer and Singer (2001). 

Collins (2000) has shown how to apply the Perceptron algorithm for structured 
output problems. See also Collins (2002). A related approach is discriminative 
learning of conditional random fields; see Lafferty et al. (2001). Structured out- 
put SVM has been studied in (Collins 2002, Taskar et al. 2003, Tsochantaridis et al. 
2004). 

The dynamic procedure we have presented for calculating the prediction hy(x) 
in the structured output section is similar to the forward-backward variables 
calculated by the Viterbi procedure in HMMs (see, for instance, (Rabiner & 
Juang 1986)). More generally, solving the maximization problem in structured out- 
put is closely related to the problem of inference in graphical models (see, for 
example, Koller & Friedman (2009a)). 

Chapelle, Le, and Smola (2007) proposed to learn a ranking function with 
respect to the NDCG loss using ideas from structured output learning. They also 
observed that the maximization problem in the definition of the generalized hinge 
loss is equivalent to the assignment problem. 

Agarwal and Roth (2005) analyzed the sample complexity of bipartite rank- 
ing. Joachims (2005) studied the applicability of structured output SVM to bipartite 
ranking with multivariate performance measures. 


17.8 EXERCISES 


17.1 Consider a set S of examples in R” x [k] for which there exist vectors py,..., hy 
such that every example (x, y) € S falls within a ball centered at w,, whose radius 
is r > 1. Assume also that for every i 4 j, ||“; — #;\| = 4r. Consider concatenating 
each instance by the constant 1 and then applying the multivector construction, 
namely, 


W(x,y)=[ 0,...,0 ,x1,...,%,1, 0,...,0 J. 
SS ees CTN 


ERO-D(n+1) eR ER(K-y)(+1) 


Show that there exists a vector w € R“"*+)) such that ¢(w, (x, y)) = 0 for every 


(x, y)eES. 
Hint: Observe that for every example (x, y) € S we can write x = “, + v for some 
livll <r. Now, take w=[w1,..., we], where wi = [#;, —llmill7/2]. 
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17.2 Multiclass Perceptron: Consider the following algorithm: 


Multiclass Batch Perceptron 


Input: 

A training set (x1, y1),---, (Kim, Ym) 

A class-sensitive feature mapping VW : ¥ x ) > R¢ 
Initialize: w) = (0,...,0) € R¢ 


Fort =1, 2,... 
If (di andy#y; s.t. (Ww, W(x;, y;)) < (w©, W(x;, y))) then 
wt) = w + W(x;, y;)) — U(x, y) 
else 
output w) 


Prove the following: 


Theorem 17.5. Assume that there exists w* such that for alli and for all y £ y; it 
holds that (w*, V(x, yi)) = (w*, U(x, y)) +1. Let R = max;,y || ¥(x;, yi) — V(X, y) I 
Then, the multiclass Perceptron algorithm stops after at most (R\\w*||)~ iterations, 
and when it stops it holds thatVi <[m], y; = argmax, (w), W(x;, y)). 


17.3 Generalize the dynamic programming procedure given in Section 17.3 for solv- 
ing the maximization problem given in the definition of h in the SGD procedure 
for multiclass prediction. You can assume that A(y’, y) = >-/_16(;, y:) for some 
arbitrary function 6. 

17.4 Prove that Equation (17.7) holds. 

17.5 Show that the two definitions of a as defined in Equation (17.12) and 
Equation (17.13) are indeed equivalent for all the multivariate performance 
measures. 
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A decision tree is a predictor, h: ¥ — J, that predicts the label associated with 
an instance x by traveling from a root node of a tree to a leaf. For simplicity we 
focus on the binary classification setting, namely, V = {0,1}, but decision trees can 
be applied for other prediction problems as well. At each node on the root-to-leaf 
path, the successor child is chosen on the basis of a splitting of the input space. 
Usually, the splitting is based on one of the features of x or on a predefined set of 
splitting rules. A leaf contains a specific label. An example of a decision tree for the 
papayas example (described in Chapter 2) is given in the following: 


Pale green to pale yellow 


Gives slightly to palm pressure 


To check if a given papaya is tasty or not, the decision tree first examines the 
color of the Papaya. If this color is not in the range pale green to pale yellow, then 
the tree immediately predicts that the papaya is not tasty without additional tests. 
Otherwise, the tree turns to examine the softness of the papaya. If the softness level 
of the papaya is such that it gives slightly to palm pressure, the decision tree predicts 
that the papaya is tasty. Otherwise, the prediction is “not-tasty.” The preceding 
example underscores one of the main advantages of decision trees — the resulting 
classifier is very simple to understand and interpret. 
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Assuming each internal node has two children,! it is not hard to show that this 
is a prefix-free encoding of the tree, and that the description length of a tree with n 
nodes is (n + 1) log, (d +3). 

By Theorem 7.7 we have that with probability of at least 1 — 6 over a sample of 
size m, for every n and every decision tree h € H with n nodes it holds that 


Pa) < ish —E (18.1) 


This bound performs a tradeoff: on the one hand, we expect larger, more complex 
decision trees to have a smaller training risk, Ls(h), but the respective value of n 
will be larger. On the other hand, smaller decision trees will have a smaller value of 
n, but Ls(h) might be larger. Our hope (or prior knowledge) is that we can find a 
decision tree with both low empirical risk, Ls(h), and a number of nodes n not too 
high. Our bound indicates that such a tree will have low true risk, Lp(h). 


18.2 DECISION TREE ALGORITHMS 


The bound on Lp(h) given in Equation (18.1) suggests a learning rule for deci- 
sion trees — search for a tree that minimizes the right-hand side of Equation (18.1). 
Unfortunately, it turns out that solving this problem is computationally hard.” Con- 
sequently, practical decision tree learning algorithms are based on heuristics such 
as a greedy approach, where the tree is constructed gradually, and locally optimal 
decisions are made at the construction of each node. Such algorithms cannot guar- 
antee to return the globally optimal decision tree but tend to work reasonably well 
in practice. 

A general framework for growing a decision tree is as follows. We start with a 
tree with a single leaf (the root) and assign this leaf a label according to a majority 
vote among all labels over the training set. We now perform a series of iterations. 
On each iteration, we examine the effect of splitting a single leaf. We define some 
“gain” measure that quantifies the improvement due to this split. Then, among all 
possible splits, we either choose the one that maximizes the gain and perform it, or 
choose not to split the leaf at all. 

In the following we provide a possible implementation. It is based on a popu- 
lar decision tree algorithm known as “ID3” (short for “Iterative Dichotomizer 3”). 
We describe the algorithm for the case of binary features, namely, X = {0,1}?, and 
therefore all splitting rules are of the form 1,1] for some feature i < [d]. We discuss 
the case of real valued features in Section 18.2.3. 

The algorithm works by recursive calls, with the initial call being ID3(S, [d]), and 
returns a decision tree. In the pseudocode that follows, we use a call to a procedure 
Gain(S,i), which receives a training set S and an index i and evaluates the gain of a 
split of the tree according to the ith feature. We describe several gain measures in 
Section 18.2.1. 


1 We may assume this without loss of generality, because if a decision node has only one child, we can 
replace the node by its child without affecting the predictions of the decision tree. 
2 More precisely, if NPAP then no algorithm can solve Equation (18.1) in time polynomial in n, d, and m. 
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1D3(S, A) 


INPUT: training set S, feature subset A C [d] 
if all examples in S are labeled by 1, return a leaf 1 
if all examples in S are labeled by 0, return a leaf 0 
if A = J, return a leaf whose value = majority of labels in S$ 
else : 
Let j = argmax,_ 4 Gain(S, i) 
if all examples in S have the same label 
Return a leaf whose value = majority of labels in S$ 
else 
Let T; be the tree returned by ID3({(x, y) € S:x; =1}, A \ {j}). 
Let T> be the tree returned by ID3({(x, y) € S: x; =0}, A \ {j}). 
Return the tree: 


18.2.1 Implementations of the Gain Measure 


Different algorithms use different implementations of Gain(S,i). Here we present 
three. We use the notation Ps [F] to denote the probability that an event holds with 
respect to the uniform distribution over S. 

Train Error: The simplest definition of gain is the decrease in training error. 
Formally, let C(a) = min{a, 1 — a}. Note that the training error before splitting on 
feature i is C(Ps[y = 1]), since we took a majority vote among labels. Similarly, the 
error after splitting on feature i is 


P[xi = 1C(BLy = 1s; = 1) + Pla = O]C(PLy = 1 = 0). 
Therefore, we can define Gain to be the difference between the two, namely, 


Gain(S,i) = C(P[y = 1) 


- (Pls: =11C(RLy= tsi = 1) + lay =0}C(B Ly =118,= 9) ). 


Information Gain: Another popular gain measure that is used in the ID3 and 
C4.5 algorithms of Quinlan (1993) is the information gain. The information gain 
is the difference between the entropy of the label before and after the split, and 
is achieved by replacing the function C in the previous expression by the entropy 
function, 


C(a) = —alog(a) — (1—a)log(1 —a). 
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Gini Index: Yet another definition of a gain, which is used by the CART 
algorithm of Breiman, Friedman, Olshen, and Stone (1984), is the Gini index, 


C(a) =2a(1—a). 


Both the information gain and the Gini index are smooth and concave upper bounds 
of the train error. These properties can be advantageous in some situations (see, 
for example, Kearns & Mansour (1996)). 


18.2.2 Pruning 


The ID3 algorithm described previously still suffers from a big problem: The 
returned tree will usually be very large. Such trees may have low empirical risk, 
but their true risk will tend to be high — both according to our theoretical analysis, 
and in practice. One solution is to limit the number of iterations of ID3, leading 
to a tree with a bounded number of nodes. Another common solution is to prune 
the tree after it is built, hoping to reduce it to a much smaller tree, but still with a 
similar empirical error. Theoretically, according to the bound in Equation (18.1), if 
we can make n much smaller without increasing Ls(h) by much, we are likely to get 
a decision tree with a smaller true risk. 

Usually, the pruning is performed by a bottom-up walk on the tree. Each node 
might be replaced with one of its subtrees or with a leaf, based on some bound or 
estimate of Lp(h) (for example, the bound in Equation (18.1)). A pseudocode of a 
common template is given in the following. 


Generic Tree Pruning Procedure 


input: 
function f(T,m) (bound/estimate for the generalization error 
of a decision tree T, based on a sample of size m), 
tree T. 
foreach node j in a bottom-up walk on T (from leaves to root): 


find T’ which minimizes f(T’,m), where T’ is any of the following: 
the current tree after replacing node j with a leaf 1. 
the current tree after replacing node j with a leaf 0. 
the current tree after replacing node j with its left subtree. 
the current tree after replacing node j with its right subtree. 
the current tree. 
let T:=T'. 


3 Threshold-Based Splitting Rules for Real-Valued Features 


In the previous section we have described an algorithm for growing a decision tree 
assuming that the features are binary and the splitting rules are of the form 1y,,—1). 
We now extend this result to the case of real-valued features and threshold-based 
splitting rules, namely, 1),; <9]. Such splitting rules yield decision stumps, and we 
have studied them in Chapter 10. 
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The basic idea is to reduce the problem to the case of binary features as follows. 
Let x1,...,Xm be the instances of the training set. For each real-valued feature i, 
sort the instances so that x1; <--- < xm . Define a set of thresholds 60,3, ..., An-+41,i 
such that 6; € (xj,i,*;+41,;) (where we use the convention x9; = —00 and X41; = 
oo). Finally, for each i and j we define the binary feature Il,, <9, ,. Once we have 
constructed these binary features, we can run the ID3 procedure described in the 
previous section. It is easy to verify that for any decision tree with threshold-based 
splitting rules over the original real-valued features there exists a decision tree over 
the constructed binary features with the same training error and the same number 
of nodes. 

If the original number of real-valued features is d and the number of examples 
is m, then the number of constructed binary features becomes dm. Calculating the 
Gain of each feature might therefore take O(dm7) operations. However, using a 
more clever implementation, the runtime can be reduced to O(dmlog(m)). The 
idea is similar to the implementation of ERM for decision stumps as described in 
Section 10.1.1. 


18.3 RANDOM FORESTS 


As mentioned before, the class of decision trees of arbitrary size has infinite VC 
dimension. We therefore restricted the size of the decision tree. Another way to 
reduce the danger of overfitting is by constructing an ensemble of trees. In par- 
ticular, in the following we describe the method of random forests, introduced by 
Breiman (2001). 

A random forest is a classifier consisting of a collection of decision trees, where 
each tree is constructed by applying an algorithm A on the training set S and an 
additional random vector, 0, where 6 is sampled i.i.d. from some distribution. The 
prediction of the random forest is obtained by a majority vote over the predictions 
of the individual trees. 

To specify a particular random forest, we need to define the algorithm A and 
the distribution over 6. There are many ways to do this and here we describe one 
particular option. We generate 6 as follows. First, we take a random subsample 
from S with replacements; namely, we sample a new training set S’ of size m’ using 
the uniform distribution over S. Second, we construct a sequence [;, b,..., where 
each J; is a subset of [d] of size k, which is generated by sampling uniformly at 
random elements from [d]. All these random variables form the vector 6. Then, 
the algorithm A grows a decision tree (e.g., using the ID3 algorithm) based on the 
sample S’, where at each splitting stage of the algorithm, the algorithm is restricted 
to choosing a feature that maximizes Gain from the set J,. Intuitively, if k is small, 
this restriction may prevent overfitting. 


18.4 SUMMARY 


Decision trees are very intuitive predictors. Typically, if a human programmer 
creates a predictor it will look like a decision tree. We have shown that the VC 
dimension of decision trees with k leaves is k and proposed the MDL paradigm for 
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learning decision trees. The main problem with decision trees is that they are com- 
putationally hard to learn; therefore we described several heuristic procedures for 
training them. 


18.5 BIBLIOGRAPHIC REMARKS 


Many algorithms for learning decision trees (such as ID3 and C4.5) have been 
derived by Quinlan (1986). The CART algorithm is due to Breiman, Friedman, 
Olshen & Stone (1984). Random forests were introduced by Breiman (2001). For 
additional reading we refer the reader to (Hastie, Tibshirani & Friedman 2001, 
Rokach 2007). 

The proof of the hardness of training decision trees is given in Hyafil and Rivest 
(1976). 


18.6 EXERCISES 


18.1 1. Show that any binary classifier h : {0, 1}4 & {0, 1} can be implemented as a deci- 
sion tree of height at most d +1, with internal nodes of the form (x; = 0?) for 


some i € {1,...,d}. 
2. Conclude that the VC dimension of the class of decision trees over the domain 
{0, 1} is 2¢. 


18.2 (Suboptimality of ID3) 
Consider the following training set, where ¥ = {0, 1}° and y = {0, 1}: 


((1,1,1), 1) 
((1, 0, 0), 1) 
((1, 1, 0), 0) 
((0, 0, 1), 0) 


Suppose we wish to use this training set in order to build a decision tree of depth 
2 (i.e., for each input we are allowed to ask two questions of the form (x; = 0?) 
before deciding on the label). 

1. Suppose we run the ID3 algorithm up to depth 2 (namely, we pick the root 
node and its children according to the algorithm, but instead of keeping on 
with the recursion, we stop and pick leaves according to the majority label in 
each subtree). Assume that the subroutine used to measure the quality of each 
feature is based on the entropy function (so we measure the information gain), 
and that if two features get the same score, one of them is picked arbitrarily. 
Show that the training error of the resulting decision tree is at least 1/4. 

2. Find a decision tree of depth 2 that attains zero training error. 


www.EngineeringBooksLibrary.com 


Nearest Neighbor 


Nearest Neighbor algorithms are among the simplest of all machine learning algo- 
rithms. The idea is to memorize the training set and then to predict the label of 
any new instance on the basis of the labels of its closest neighbors in the training 
set. The rationale behind such a method is based on the assumption that the fea- 
tures that are used to describe the domain points are relevant to their labelings in a 
way that makes close-by points likely to have the same label. Furthermore, in some 
situations, even when the training set is immense, finding a nearest neighbor can 
be done extremely fast (for example, when the training set is the entire Web and 
distances are based on links). 

Note that, in contrast with the algorithmic paradigms that we have discussed 
so far, like ERM, SRM, MDL, or RLM, that are determined by some hypothesis 
class, H, the Nearest Neighbor method figures out a label on any test point without 
searching for a predictor within some predefined class of functions. 

In this chapter we describe Nearest Neighbor methods for classification and 
regression problems. We analyze their performance for the simple case of binary 
classification and discuss the efficiency of implementing these methods. 


19.1 k NEAREST NEIGHBORS 


Throughout the entire chapter we assume that our instance domain, 1’, is endowed 
with a metric function p. That is, 0: Y x ¥ > Risa function that returns the distance 
between any two elements of 1’. For example, if 4 = R@ then p can be the Euclidean 


distance, p(x, x’) = ||x —x’|| = 4 (x; —x/)*. 


Let S = (x1, y1),..., (Xm, Ym) be a sequence of training examples. For eachx € ¥, 
let 71(x),...,7m(x) be a reordering of {1,...,m} according to their distance to x, 
o(x,x;). That is, for alli <m, 


p(x, Xr; (x)) < px, Xr 41(x)): 


www.EngineeringBooksLibrary.com 


219 


220 


Nearest Neighbor 


Figure 19.1. An illustration of the decision boundaries of the 1-NN rule. The points 
depicted are the sample points, and the predicted label of any new point will be the 
label of the sample point in the center of the cell it belongs to. These cells are called a 
Voronoi Tessellation of the space. 

For a number k, the k-NN rule for binary classification is defined as follows: 


k-NN 


input: a training sample S = (x1, y1),.--, (Xm, Ym) 


output: for every point x € X, 
return the majority label among {y,,(x) :i < k} 


When k = 1, we have the 1-NN rule: 


hs(x) = Yay (x): 


A geometric illustration of the 1-NN rule is given in Figure 19.1. 

For regression problems, namely, V = R, one can define the prediction to be 
the average target of the k nearest neighbors. That is, s(x) = 74 Yn;(x)- More 
generally, for some function ¢ : (¥ x )* > Y, the k-NN rule with respect to ¢ is: 


hs(x) =  ((X(x)> Yay (x))o-+ + (Kaz (x)> Vist) (19.1) 


It is easy to verify that we can cast the prediction by majority of labels (for clas- 
sification) or by the averaged target (for regression) as in Equation (19.1) by an 
appropriate choice of ¢. The generality can lead to other rules; for example, if V=R, 
we can take a weighted average of the targets according to the distance from x: 


k 
p(x, Xx;(x)) 
Asx) = ) = i) 
> yo %0)) 


19.2 ANALYSIS 


Since the NN rules are such natural learning methods, their generalization proper- 
ties have been extensively studied. Most previous results are asymptotic consistency 
results, analyzing the performance of NN rules when the sample size, m, goes to 
infinity, and the rate of convergence depends on the underlying distribution. As we 
have argued in Section 7.4, this type of analysis is not satisfactory. One would like to 
learn from finite training samples and to understand the generalization performance 
as a function of the size of such finite training sets and clear prior assumptions on 
the data distribution. We therefore provide a finite-sample analysis of the 1-NN rule, 
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showing how the error decreases as a function of m and how it depends on proper- 
ties of the distribution. We will also explain how the analysis can be generalized to 
k-NN rules for arbitrary values of k. In particular, the analysis specifies the number 
of examples required to achieve a true error of 2Lp(h*) + €, where h* is the Bayes 
optimal hypothesis, assuming that the labeling rule is “well behaved" (in a sense we 
will define later). 


19.2.1 A Generalization Bound for the 1-NN Rule 


We now analyze the true error of the 1-NN rule for binary classification with the 0-1 
loss, namely, Y = {0,1} and ¢(A, (x, y)) = Inqyzy]- We also assume throughout the 
analysis that X = [0,1]? and p is the Euclidean distance. 

We start by introducing some notation. Let D be a distribution over 1 x Y. Let 
Dx denote the induced marginal distribution over X and let n : R4 > R be the 
conditional probability! over the labels, that is, 


n(x) = P[y = 1 |x]. 


Recall that the Bayes optimal rule (that is, the hypothesis that minimizes Lp(h) over 
all functions) is 


A* (x) = Uy >1/2- 


We assume that the conditional probability function 7 is c-Lipschitz for some 
c > 0: Namely, for all x,x’ € ¥,  |n(x) — n(x’)| < c||x —x’||. In other words, this 
assumption means that if two vectors are close to each other then their labels are 
likely to be the same. 

The following lemma applies the Lipschitzness of the conditional probability 
function to upper bound the true error of the 1-NN rule as a function of the expected 
distance between each test instance and its nearest neighbor in the training set. 


Lemma 19.1. Let X = [0,1]¢,Y = {0,1}, and D be a distribution over X x Y 
for which the conditional probability function, n, is a c-Lipschitz function. Let 
S = (x1, y1),---,(X%m,¥m) be an i.i.d. sample and let hs be its corresponding 1-NN 
hypothesis. Let h* be the Bayes optimal rule for n. Then, 


Ealtohs)] s2Lo(h") +e E  [Ik-Xx,eplh 


Proof. Since Lp(hs) = Eq,y)~p [Lfnsaxygy], we obtain that Es [Lp(hs)] is the prob- 
ability to sample a training set S and an additional example (x, y), such that the 
label of 71(x) is different from y. In other words, we can first sample m unlabeled 
examples, S, = (X1,...,Xm), according to Dx, and an additional unlabeled example, 
x ~ Dx, then find 7(x) to be the nearest neighbor of x in S,, and finally sample 


1 Formally, P[y = 1|x] = lims_,o ea en 


Dual yyw eB.) VEV where B(x,5) is a ball of radius 5 centered 


around x. 
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y ~ n(x) and yzi(x) ~ n(711(x)). It follows that 


E[Lp(hs)] = 


U Ipyzy 
Speeches eta). bal 


= P [y#y']]. (19.2) 


Sy~Dip x~Dx Ly~n(x),y’~n(r1(x)) 


We next upper bound P,, (x), y~n(’) Ly 4 y’] for any two domain points x, x’: 
[y Ay] =n@)0 — n(x) + 0 — n(x’) n&) 


= (n(x) — n(x) + n(x’))(1 — n(x)) 
+ (1 = n(x) + n(x) — n(x’))n(x) 
= 2n(x)(1 — n(x)) + (n(x) — n(x’))(2n(x) — 1). 


Using |27(x) — 1| < 1 and the assumption that 7 is c-Lipschitz, we obtain that the 
probability is at most: 


P 
y~ntx),y’~n(x’) 


P Ly A y’] < 2n(x)(1 — n(x)) + € xx’. 
y~n(x),y/~n(x’) 


Plugging this into Equation (19.2) we conclude that 


E[Lo(hs)] < El2nx\(1- ns) +¢ E [Ik Xm oll: 


Finally, the error of the Bayes optimal classifier is 


Lo(h*) = E[min{n(x), 1 — n(x)}] = E[n&)(1 — nx) 


Combining the preceding two inequalities concludes our proof. oO 


The next step is to bound the expected distance between a random x and its 
closest element in S. We first need the following general probability lemma. The 
lemma bounds the probability weight of subsets that are not hit by a random sample, 
as a function of the size of that sample. 


Lemma 19.2. Let C),...,C,; be a collection of subsets of some domain set, X. Let S 
be a sequence of m points sampled i.i.d. according to some probability distribution, D 
over X. Then, 


PIG) = = 


; me 
ECINS= 


S~pm 


Proof. From the linearity of expectation, we can rewrite: 


dX Plc) = SPICE [tcns—a] - 
i=1 


i:C;S=0 


S 


Next, for each i we have 


E [Acyns-m] = BIC NS =9] = (1-P[ci)" <e Pm, 
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Combining the preceding two equations we get 


S> PIC]] < SoP[Ci]e Pl" < + maxP[C]Je PIG, 
i:C;NS=6 i=1 ' 


Finally, by a standard calculus, maxgae7""" < _ and this concludes the proof. O 


Equipped with the preceding lemmas we are now ready to state and prove the 
main result of this section — an upper bound on the expected error of the 1-NN 
learning rule. 


Theorem 19.3. Let X = [0, 1]¢,y = {0,1}, and D be a distribution over X x Y for 
which the conditional probability function, n, is a c-Lipschitz function. Let hs denote 
the result of applying the 1-NN rule to a sample S ~ D". Then, 


iL 
; EL hs)] <2Lp(h*)+4eVdm FT, 


Proof. Fix some « = 1/T, for some integer T, let r = T¢ and let Cj,...,C, be the 
cover of the set 1 using boxes of length e: Namely, for every (,...,a@a) € [T]*, 
there exists a set C; of the form {x:Vj,x; €[(a@j —1)/T,a;/T]}. An illustration for 
d =2, T =5 and the set corresponding to a = (2,4) is given in the following. 


xn 


For each x,x’ in the same box we have ||x — x’|| < de. Otherwise, ||x — x’|| < Jd. 
Therefore, 


P) U ajva+P) U ceva], 


i:C/NS=O i:C;S#D 


1G) 


[|x — X71(x) Il] < 5 


and by combining Lemma 19.2 with the trivial bound P[U;.¢,qszg Ci] < 1 we get that 


EL — Xm] < Vd(4 +e) . 


Since the number of boxes is r = (1/€)¢ we get that 


d -—d 
[IK=Xn all] < Vd (AE +e). 


me 


S 


Combining the preceding with Lemma 19.1 we obtain that 


me 


E[Lp(hs)] < 2Lp(h*)+ceVd (a= +e) 
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Finally, setting « = 2m—!/@+)) and noting that 


aaa d5—d »d/(d4+1 
2 eet mn tet) 
me me 


= m-V@+V(1 fe +2)< 4m—/(4+1) 


we conclude our proof. O 


The theorem implies that if we first fix the data-generating distribution and 
then let m go to infinity, then the error of the 1-NN rule converges to twice the 
Bayes error. The analysis can be generalized to larger values of k, showing that 
the expected error of the k-NN rule converges to (1 + ./8/k) times the error of the 
Bayes classifier. This is formalized in Theorem 19.5, whose proof is left as a guided 
exercise. 


19.2.2 The “Curse of Dimensionality” 

The upper bound given in Theorem 19.3 grows with c (the Lipschitz coefficient of 7) 
and with d, the Euclidean dimension of the domain set V. In fact, it is easy to see that 
a necessary condition for the last term in Theorem 19.3 to be smaller than «€ is that 
m > (4cVd/e)“*". That is, the size of the training set should increase exponentially 
with the dimension. The following theorem tells us that this is not just an artifact 
of our upper bound, but, for some distributions, this amount of examples is indeed 
necessary for learning with the NN rule. 


Theorem 19.4. For any c > 1, and every learning rule, L, there exists a distribution 
over [0,1]? x {0, 1}, such that n(x) is c-Lipschitz, the Bayes error of the distribution is 
0, but for sample sizes m < (c + 1)4/2, the true error of the rule L is greater than 1/4. 


Proof. Fix any values of c and d. Let Gé be the grid on [0,1]? with distance of 
1/c between points on the grid. That is, each point on the grid is of the form 
(a1/c,...,da/c) where q; is in {0,...,c —1,c}. Note that, since any two distinct points 
on this grid are at least 1/c apart, any function 7 : G2 — [0,1] is a c-Lipschitz 
function. It follows that the set of all c-Lipschitz functions over G4 contains the 
set of all binary valued functions over that domain. We can therefore invoke the 
No-Free-Lunch result (Theorem 5.1) to obtain a lower bound on the needed sam- 
ple sizes for learning that class. The number of points on the grid is (c + 1)¢; hence, 
if m <(c+1)4/2, Theorem 5.1 implies the lower bound we are after. oO 


The exponential dependence on the dimension is known as the curse of dimen- 
sionality. As we saw, the 1-NN rule might fail if the number of examples is smaller 
than Q((c + 1)“). Therefore, while the 1-NN rule does not restrict itself to a prede- 
fined set of hypotheses, it still relies on some prior knowledge — its success depends 
on the assumption that the dimension and the Lipschitz constant of the underlying 
distribution, 7, are not too high. 
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19.3 EFFICIENT IMPLEMENTATION* 


Nearest Neighbor is a learning-by-memorization type of rule. It requires the entire 
training data set to be stored, and at test time, we need to scan the entire data set in 
order to find the neighbors. The time of applying the NN rule is therefore O(dm). 
This leads to expensive computation at test time. 

When d is small, several results from the field of computational geometry have 
proposed data structures that enable to apply the NN rule in time o(d°“ log(m)). 
However, the space required by these data structures is roughly m°™, which makes 
these methods impractical for larger values of d. 

To overcome this problem, it was suggested to improve the search method by 
allowing an approximate search. Formally, an r-approximate search procedure is 
guaranteed to retrieve a point within distance of at most r times the distance to the 
nearest neighbor. Three popular approximate algorithms for NN are the kd-tree, 
balltrees, and locality-sensitive hashing (LSH). We refer the reader, for example, to 
(Shakhnarovich, Darrell & Indyk 2006). 


19.4 SUMMARY 


The k-NN rule is a very simple learning algorithm that relies on the assumption 
that “things that look alike must be alike.” We formalized this intuition using the 
Lipschitzness of the conditional probability. We have shown that with a sufficiently 
large training set, the risk of the 1-NN is upper bounded by twice the risk of the 
Bayes optimal rule. We have also derived a lower bound that shows the “curse of 
dimensionality” — the required sample size might increase exponentially with the 
dimension. As a result, NN is usually performed in practice after a dimensionality 
reduction preprocessing step. We discuss dimensionality reduction techniques later 
on in Chapter 23. 


19.5 BIBLIOGRAPHIC REMARKS 


Cover and Hart (1967) gave the first analysis of 1-NN, showing that its risk con- 
verges to twice the Bayes optimal error under mild conditions. Following a lemma 
due to Stone (1977), Devroye and Gyorfi (1985) have shown that the k-NN rule is 
consistent (with respect to the hypothesis class of all functions from R¢ to {0, 1}). 
A good presentation of the analysis is given in the book by Devroye et al. (1996). 
Here, we give a finite sample guarantee that explicitly underscores the prior assump- 
tion on the distribution. See Section 7.4 for a discussion on consistency results. 
Finally, Gottlieb, Kontorovich, and Krauthgamer (2010) derived another finite 
sample bound for NN that is more similar to VC bounds. 


19.6 EXERCISES 
In this exercise we will prove the following theorem for the k-NN rule. 


Theorem 19.5. Let ¥ = [0,1]¢, = {0,1}, and D be a distribution over X x y for 
which the conditional probability function, n, is a c-Lipschitz function. Let hs denote 
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the result of applying the k-NN rule to a sample S ~D", where k > 10. Let h* be the 
Bayes optimal hypothesis. Then, 


E[Lo(hs)] < ( # \3) Lo(h") + (6eVd +k) me, 


19.1 Prove the following lemma. 


Lemma 19.6. Let C,,...,C, bea collection of subsets of some domain set, X. Let S 
be a sequence of m points sampled i.i.d. according to some probability distribution, 
D over X. Then, for every k > 2, 


ihe il} = 


i:|C]NS| <k 


Hints: 
® Show that 


E} >) PIC] =e PICUE Cols 


i:|C;NS|<k 
™ Fix some i and suppose that k < P[C;]m/2. Use Chernoff’s bound to show that 
PICS| <k] <P[ICiN S| <P[C]m/2] < glans 


M®@ Use the inequality max, ae~"" < = to show that for such i we have 


8 
P[C]P[IC, NS| <k] <P[CJePGl"® < —. 
S me 
™ Conclude the proof by using the fact that for the case k > P[C;]m/2 we clearly 
have: 
2k 
P[CJP[IC:NS| <k] < PIC] <—. 
m 
19.2 We use the notation y ~ p as a shorthand for “y is a Bernoulli random variable 
with expected value p.” Prove the following lemma: 


Lemma 19.7. Let k > 10 and let Z1,...,Z, be independent Bernoulli random 
variables with P[Z; = 1] = p;. Denote p= ii pi and p' = ot Z;. Show that 


Hints: 
W.Lo.g. assume that p < 1/2. Then, Py~,[y 4 Ipp31/2)] = p. Let y’ = Wp]. 
® Show that 


es \-p=_ P_ [p'>1/2](1—-2p). 
Anes Ago ts i Diivcess ye /21( P) 


M™@ Use Chernoff’s bound (Lemma B.3) to show that 


P[p’ > 1/2] < oo keh ap—t) 


where 
h(a) = (1+a)log(1+a)—a. 
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An artificial neural network is a model of computation inspired by the structure of 
neural networks in the brain. In simplified models of the brain, it consists of a large 
number of basic computing devices (neurons) that are connected to each other in 
a complex communication network, through which the brain is able to carry out 
highly complex computations. Artificial neural networks are formal computation 
constructs that are modeled after this computation paradigm. 

Learning with neural networks was proposed in the mid-20th century. It yields 
an effective learning paradigm and has recently been shown to achieve cutting-edge 
performance on several learning tasks. 

A neural network can be described as a directed graph whose nodes correspond 
to neurons and edges correspond to links between them. Each neuron receives as 
input a weighted sum of the outputs of the neurons connected to its incoming edges. 
We focus on feedforward networks in which the underlying graph does not contain 
cycles. 

In the context of learning, we can define a hypothesis class consisting of neural 
network predictors, where all the hypotheses share the underlying graph structure 
of the network and differ in the weights over edges. As we will show in Section 20.3, 
every predictor over n variables that can be implemented in time T(n) can also be 
expressed as a neural network predictor of size O(T(n)*), where the size of the net- 
work is the number of nodes in it. It follows that the family of hypothesis classes 
of neural networks of polynomial size can suffice for all practical learning tasks, in 
which our goal is to learn predictors which can be implemented efficiently. Fur- 
thermore, in Section 20.4 we will show that the sample complexity of learning such 
hypothesis classes is also bounded in terms of the size of the network. Hence, it 
seems that this is the ultimate learning paradigm we would want to adapt, in the 
sense that it both has a polynomial sample complexity and has the minimal approx- 
imation error among all hypothesis classes consisting of efficiently implementable 
predictors. 

The caveat is that the problem of training such hypothesis classes of neural net- 
work predictors is computationally hard. This will be formalized in Section 20.5. 


www.EngineeringBooksLibrary.com 


20.1 Feedforward Neural Networks 


A widely used heuristic for training neural networks relies on the SGD frame- 
work we studied in Chapter 14. There, we have shown that SGD is a successful 
learner if the loss function is convex. In neural networks, the loss function is highly 
nonconvex. Nevertheless, we can still implement the SGD algorithm and hope it will 
find a reasonable solution (as happens to be the case in several practical tasks). In 
Section 20.6 we describe how to implement SGD for neural networks. In particular, 
the most complicated operation is the calculation of the gradient of the loss func- 
tion with respect to the parameters of the network. We present the backpropagation 
algorithm that efficiently calculates the gradient. 


20.1 FEEDFORWARD NEURAL NETWORKS 


The idea behind neural networks is that many neurons can be joined together by 
communication links to carry out complex computations. It is common to describe 
the structure of a neural network as a graph whose nodes are the neurons and each 
(directed) edge in the graph links the output of some neuron to the input of another 
neuron. We will restrict our attention to feedforward network structures in which 
the underlying graph does not contain cycles. 

A feedforward neural network is described by a directed acyclic graph, G = 
(V, E), and a weight function over the edges, w: E > R. Nodes of the graph cor- 
respond to neurons. Each single neuron is modeled as a simple scalar function, 
o :R—R. We will focus on three possible functions for o: the sign function, o(a) = 
sign(a), the threshold function, o(a) = 1,59], and the sigmoid function, o(a) = 
1/(1 + exp(—a)), which is a smooth approximation to the threshold function. We 
call o the “activation” function of the neuron. Each edge in the graph links the 
output of some neuron to the input of another neuron. The input of a neuron is 
obtained by taking a weighted sum of the outputs of all the neurons connected to it, 
where the weighting is according to w. 

To simplify the description of the calculation performed by the network, we 
further assume that the network is organized in /ayers. That is, the set of nodes can 
be decomposed into a union of (nonempty) disjoint subsets, V = U/_)V;, such that 
every edge in E connects some node in V;_; to some node in V;, for some t € [T]. 
The bottom layer, Vo, is called the input layer. It contains n + 1 neurons, where n is 
the dimensionality of the input space. For every i € [n], the output of neuron i in Vo 
is simply x;. The last neuron in Vo is the “constant” neuron, which always outputs 1. 
We denote by v,,; the ith neuron of the rth layer and by 0; ,;(x) the output of v;,; when 
the network is fed with the input vector x. Therefore, for i € [n] we have 09,;(x) = x; 
and for i =n+1 we have 09,;(x) = 1. We now proceed with the calculation in a layer 
by layer manner. Suppose we have calculated the outputs of the neurons at layer f. 
Then, we can calculate the outputs of the neurons at layer t+ 1 as follows. Fix some 
v41,j © Viz1- Let a;41,;(x) denote the input to v;,1,; when the network is fed with 
the input vector x. Then, 


p41, j(X) = S- w((urr U;41,j)) Or,r(X); 
r(Upr Up, j )EE 


and 
0141,j(%) = 6 (a741,;(%)) 
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That is, the input to v;+1,; is a weighted sum of the outputs of the neurons in V, that 
are connected to v;+1,;, where weighting is according to w, and the output of v,+1,; 
is simply the application of the activation function o on its input. 

Layers V,,...,Vr_1 are often called hidden layers. The top layer, Vr, is called 
the output layer. In simple prediction problems the output layer contains a single 
neuron whose output is the output of the network. 

We refer to T as the number of layers in the network (excluding Vo), or the 
“depth” of the network. The size of the network is |V|. The “width” of the network 
is max; |V;|. An illustration of a layered feedforward neural network of depth 2, size 
10, and width 5, is given in the following. Note that there is a neuron in the hidden 
layer that has no incoming edges. This neuron will output the constant o(0). 


Input Hidden Output 
Layer Layer Layer 


(V2) 


20.2 LEARNING NEURAL NETWORKS 


Once we have specified a neural network by (V, E,o,w), we obtain a function 
hy .£.0,w 2 R'Yol-! — R!Yr!, Any set of such functions can serve as a hypothesis class 
for learning. Usually, we define a hypothesis class of neural network predictors by 
fixing the graph (V, £) as well as the activation function o and letting the hypothesis 
class be all functions of the form hy,¢.4,w for some w: E > R. The triplet (V, E,o) 
is often called the architecture of the network. We denote the hypothesis class by 


Hy eo = {hy.zc,w : wis a mapping from E to R}. (20.1) 


That is, the parameters specifying a hypothesis in the hypothesis class are the 
weights over the edges of the network. 

We can now study the approximation error, estimation error, and optimization 
error of such hypothesis classes. In Section 20.3 we study the approximation error 
of Hy.z,0 by studying what type of functions hypotheses in Hy,z¢,, can implement, 
in terms of the size of the underlying graph. In Section 20.4 we study the estimation 
error of Hy,z,,, for the case of binary classification (i.e., Vr = 1 and o is the sign 
function), by analyzing its VC dimension. Finally, in Section 20.5 we show that it 
is computationally hard to learn the class Hy,z,,, even if the underlying graph is 
small, and in Section 20.6 we present the most commonly used heuristic for training 
Hv ,E,o- 
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20.3 THE EXPRESSIVE POWER OF NEURAL NETWORKS 


In this section we study the expressive power of neural networks, namely, what type 
of functions can be implemented using a neural network. More concretely, we will 
fix some architecture, V, E,o, and will study what functions hypotheses in Hy z.¢ 
can implement, as a function of the size of V. 

We start the discussion with studying which type of Boolean functions (i.e., 
functions from {+1}" to {+1}) can be implemented by Hy,z sien. Observe that for 
every computer in which real numbers are stored using b bits, whenever we cal- 
culate a function f : R’ — R on such a computer we in fact calculate a function 
g:{+1}"? > {+1}’. Therefore, studying which Boolean functions can be imple- 
mented by Hy, <£,sign can tell us which functions can be implemented on a computer 
that stores real numbers using b bits. 

We begin with a simple claim, showing that without restricting the size of the 
network, every Boolean function can be implemented using a neural network of 
depth 2. 


Claim 20.1. For every n, there exists a graph (V, E) of depth 2, such that Hy x, sien 
contains all functions from {+1}" to {+1}. 


Proof. We construct a graph with |Vo| =n+1,|Vi|=2"+1, and |V2|=1. Let E be all 
possible edges between adjacent layers. Now, let f : {+1}” — {£1} be some Boolean 
function. We need to show that we can adjust the weights so that the network will 
implement f. Let u,,...,u,z be all vectors in {£1}” on which f outputs 1. Observe 
that for every i and every x € {+1}”, ifx 4. u; then (x,u;) <n—2 andifx =u, then 
(x,u;) =n. It follows that the function g;(x) = sign((x,u;) — +1) equals 1 if and 
only if x = uj. It follows that we can adapt the weights between Vp and Vj so that for 
every i € [k], the neuron v,; implements the function g;(x). Next, we observe that 
f (x) is the disjunction of the functions g;(x), and therefore can be written as 


k 
f(x) = sign (dre+4-1), 


i=] 


which concludes our proof. O 


The preceding claim shows that neural networks can implement any Boolean 
function. However, this is a very weak property, as the size of the resulting network 
might be exponentially large. In the construction given at the proof of Claim 20.1, 
the number of nodes in the hidden layer is exponentially large. This is not an artifact 
of our proof, as stated in the following theorem. 


Theorem 20.2. For every n, let s(n) be the minimal integer such that there exists a 
graph (V,E) with |V| = s(n) such that the hypothesis class Hy ,£,sign contains all the 
functions from {0,1}" to {0,1}. Then, s(n) is exponential in n. Similar results hold for 
Hy _.E.c Where o is the sigmoid function. 


Proof. Suppose that for some (V, £) we have that Hy, ¢,sizn contains all functions 
from {0, 1}”" to {0, 1}. It follows that it can shatter the set of m = 2” vectors in {0, 1}” 
and hence the VC dimension of Hy, gz, sign 18 2”. On the other hand, the VC dimen- 
sion of Hy,£,sign is bounded by O(|E|log(|E|)) < O(|V|°), as we will show in the 
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next section. This implies that |V| > Q(2”/7), which concludes our proof for the 
case of networks with the sign activation function. The proof for the sigmoid case is 
analogous. O 


Remark 20.1. It is possible to derive a similar theorem for Hy z,. for any o, as long 
as we restrict the weights so that it is possible to express every weight using a number 
of bits which is bounded by a universal constant. We can even consider hypothesis 
classes where different neurons can employ different activation functions, as long as 
the number of allowed activation functions is also finite. 


Which functions can we express using a network of polynomial size? The pre- 
ceding claim tells us that it is impossible to express all Boolean functions using a 
network of polynomial size. On the positive side, in the following we show that all 
Boolean functions that can be calculated in time O(T(n)) can also be expressed by 
a network of size O(T(n)’). 


Theorem 20.3. Let T: N > N and for every n, let F,, be the set of functions that can 
be implemented using a Turing machine using runtime of at most T(n). Then, there 
exist constants b,c € R4 such that for every n, there is a graph (V,, En) of size at most 
cT(n) +b such that Hy, £,,sign Contains Fy. 


The proof of this theorem relies on the relation between the time complexity 
of programs and their circuit complexity (see, for example, Sipser (2006)). In a 
nutshell, a Boolean circuit is a type of network in which the individual neurons 
implement conjunctions, disjunctions, and negation of their inputs. Circuit com- 
plexity measures the size of Boolean circuits required to calculate functions. The 
relation between time complexity and circuit complexity can be seen intuitively as 
follows. We can model each step of the execution of a computer program as a simple 
operation on its memory state. Therefore, the neurons at each layer of the network 
will reflect the memory state of the computer at the corresponding time, and the 
translation to the next layer of the network involves a simple calculation that can 
be carried out by the network. To relate Boolean circuits to networks with the sign 
activation function, we need to show that we can implement the operations of con- 
junction, disjunction, and negation, using the sign activation function. Clearly, we 
can implement the negation operator using the sign activation function. The follow- 
ing lemma shows that the sign activation function can also implement conjunctions 
and disjunctions of its inputs. 


Lemma 20.4. Suppose that a neuron v, that implements the sign activation function, 
has k incoming edges, connecting it to neurons whose outputs are in {+1}. Then, by 
adding one more edge, linking a “constant” neuron to v, and by adjusting the weights 
on the edges to v, the output of v can implement the conjunction or the disjunction of 
its inputs. 


Proof. Simply observe that if f : {41} > {+1} is the conjunction function, f(x) = 
A;ix;, then it can be written as f(x) = sign (1 —k+ ae) Similarly, the disjunc- 


tion function, f(x) = V;x;, can be written as f(x) = sign (k —1+ ee) ; Oo 
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20.4 THE SAMPLE COMPLEXITY OF NEURAL NETWORKS 


Next we discuss the sample complexity of learning the class Hy z,.. Recall that the 
fundamental theorem of learning tells us that the sample complexity of learning a 
hypothesis class of binary classifiers depends on its VC dimension. Therefore, we 
focus on calculating the VC dimension of hypothesis classes of the form Hy,z,., 
where the output layer of the graph contains a single neuron. 

We start with the sign activation function, namely, with Hy, sign. What is the VC 
dimension of this class? Intuitively, since we learn |E| parameters, the VC dimen- 
sion should be order of |£|. This is indeed the case, as formalized by the following 
theorem. 


Theorem 20.6. The VC dimension of Hy ,z, sign is O(\E|log (|E|)). 


Proof. To simplify the notation throughout the proof, let us denote the hypothesis 
class by H. Recall the definition of the growth function, t3,(m), from Section 6.5.1. 
This function measures maxccx:|c\|=m |Hc|, where Hc is the restriction of H to func- 
tions from C to {0,1}. We can naturally extend the definition for a set of functions 
from 4 to some finite set V, by letting Hc be the restriction of H to functions from 
C to Y, and keeping the definition of t3;(m) intact. 

Our neural network is defined by a layered graph. Let Vo,..., Vr be the layers 
of the graph. Fix some t € [7]. By assigning different weights on the edges between 
V,_1 and V;, we obtain different functions from R!-1! + {+1}!”!. Let H be the 
class of all possible such mappings from R!“-1! + {+1}!”!, Then, H can be written 
as a composition, H =H) o...0H), In Exercise 20.4 we show that the growth 
function of a composition of hypothesis classes is bounded by the products of the 
growth functions of the individual classes. Therefore, 


T 
tu(m) < [[ t(m). 


t=1 


In addition, each H can be written as a product of function classes, 1 =H¢) x 
-- x HIM), where each 1“) is all functions from layer t — 1 to {+1} that the jth 
neuron of layer ¢ can implement. In Exercise 20.3 we bound product classes, and 
this yields 

[Vi | 

Tai (m) < |] tye. m). 

i=l 
Let d;; be the number of edges that are headed to the ith neuron of layer f. 
Since the neuron is a homogenous halfspace hypothesis and the VC dimension of 
homogenous halfspaces is the dimension of their input, we have by Sauer’s lemma 
that 

dij 
Tayro(m) < (4) < (em), 


t; 


Overall, we obtained that 


ty(m) < (em) 211 di = (em)'"|, 
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Now, assume that there are m shattered points. Then, we must have t,(m) = 2”, 
from which we obtain 


2" <(em)!£| = m<|E|log(em)/log(2). 
The claim follows by Lemma A.2. O 


Next, we consider Hy ,z,., where o is the sigmoid function. Surprisingly, it turns 
out that the VC dimension of Hy,z,. is lower bounded by Q(|E|?) (see Exercise 
20.5.) That is, the VC dimension is the number of tunable parameters squared. It 
is also possible to upper bound the VC dimension by O(|V|?|E|*), but the proof 
is beyond the scope of this book. In any case, since in practice we only consider 
networks in which the weights have a short representation as floating point numbers 
with O(1) bits, by using the discretization trick we easily obtain that such networks 
have a VC dimension of O(|E|), even if we use the sigmoid activation function. 


20.5 THE RUNTIME OF LEARNING NEURAL NETWORKS 


In the previous sections we have shown that the class of neural networks with an 
underlying graph of polynomial size can express all functions that can be imple- 
mented efficiently, and that the sample complexity has a favorable dependence on 
the size of the network. In this section we turn to the analysis of the time complexity 
of training neural networks. 

We first show that it is NP hard to implement the ERM rule with respect to 
Hv, e£,sign even for networks with a single hidden layer that contain just 4 neurons in 
the hidden layer. 


Theorem 20.7. Let k > 3. For every n, let (V,E) be a layered graph with n input 
nodes, k +1 nodes at the (single) hidden layer, where one of them is the constant 
neuron, and a single output node. Then, it is NP hard to implement the ERM rule 
with respect to Hy ,z, sign: 


The proof relies on a reduction from the k-coloring problem and is left as 
Exercise 20.6. 

One way around the preceding hardness result could be that for the purpose of 
learning, it may suffice to find a predictor h € H with low empirical error, not neces- 
sarily an exact ERM. However, it turns out that even the task of finding weights that 
result in close-to-minimal empirical error is computationally infeasible (see (Bartlett 
& Ben-David 2002)). 

One may also wonder whether it may be possible to change the architecture 
of the network so as to circumvent the hardness result. That is, maybe ERM 
with respect to the original network structure is computationally hard but ERM 
with respect to some other, larger, network may be implemented efficiently (see 
Chapter 8 for examples of such cases). Another possibility is to use other activation 
functions (such as sigmoids, or any other type of efficiently computable activation 
functions). There is a strong indication that all of such approaches are doomed to 
fail. Indeed, under some cryptographic assumption, the problem of learning inter- 
sections of halfspaces is known to be hard even in the representation independent 
model of learning (see Klivans & Sherstov (2006)). This implies that, under the 
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same cryptographic assumption, any hypothesis class which contains intersections 
of halfspaces cannot be learned efficiently. 

A widely used heuristic for training neural networks relies on the SGD frame- 
work we studied in Chapter 14. There, we have shown that SGD is a successful 
learner if the loss function is convex. In neural networks, the loss function is highly 
nonconvex. Nevertheless, we can still implement the SGD algorithm and hope 
it will find a reasonable solution (as happens to be the case in several practical 
tasks). 


20.6 SGD AND BACKPROPAGATION 


The problem of finding a hypothesis in Hy,¢,. with a low risk amounts to the prob- 
lem of tuning the weights over the edges. In this section we show how to apply a 
heuristic search for good weights using the SGD algorithm. Throughout this section 
we assume that o is the sigmoid function, o(a) = 1/(1+e~%), but the derivation 
holds for any differentiable scalar function. 

Since E is a finite set, we can think of the weight function as a vector we RI#!. 
Suppose the network has n input neurons and & output neurons, and denote by 
hw : IR” — R* the function calculated by the network if the weight function is defined 
by w. Let us denote by A(iw(x), y) the loss of predicting hy(x) when the target 
is y € Y. For concreteness, we will take A to be the squared loss, A(hw(x), y) = 
5 Ilhw(x) —y||*; however, similar derivation can be obtained for every differentiable 
function. Finally, given a distribution D over the examples domain, R” x R*, let 
Lp(w) be the risk of the network, namely, 


Low) = B[A(rw(%).y)] 


Recall the SGD algorithm for minimizing the risk function Lp(w). We repeat 
the pseudocode from Chapter 14 with a few modifications, which are relevant to the 
neural network application because of the nonconvexity of the objective function. 
First, while in Chapter 14 we initialized w to be the zero vector, here we initialize w 
to be a randomly chosen vector with values close to zero. This is because an initial- 
ization with the zero vector will lead all hidden neurons to have the same weights 
(if the network is a full layered network). In addition, the hope is that if we repeat 
the SGD procedure several times, where each time we initialize the process with 
a new random vector, one of the runs will lead to a good local minimum. Second, 
while a fixed step size, 7, is guaranteed to be good enough for convex problems, 
here we utilize a variable step size, 7;, as defined in Section 14.4.2. Because of the 
nonconvexity of the loss function, the choice of the sequence n, is more significant, 
and it is tuned in practice by a trial and error manner. Third, we output the best 
performing vector on a validation set. In addition, it is sometimes helpful to add reg- 
ularization on the weights, with parameter A. That is, we try to minimize Lp(w) + 
4 \lwl?. Finally, the gradient does not have a closed form solution. Instead, it is 
implemented using the backpropagation algorithm, which will be described in the 
sequel. 
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parameters: 

number of iterations t 

step size sequence 71, 72,..-, Nr 

regularization parameter A > 0 
input: 

layered graph (V, E) 

differentiable activation functiono :R—R 
initialize: 

choose w) € R!#! at random 

(from a distribution s.t. w is close enough to 0) 

fori =1,2,...,T 

sample (x,y) ~ D 

calculate gradient v; = backpropagation(x, y, w, (V, E),o) 

update w+) = wO — nj(v; +aw) 
output: 

w is the best performing w“ on a validation set 


Backpropagation 


input: 
example (x, y), weight vector w, layered graph (V, E), 
activation functiona: R—R 
initialize: 
denote layers of the graph Vo,..., Vr where V; = {v.1, 
define W,.;,; as the weight of (vy, ;, v:41,1) 
(where we set W,,;,; =0 if (uj, Ur41,;) ¢ E) 
forward: 
set 09 =x 
fort =1,...,T 
fori=1,...,k; 
set ay; = oi Wi—1i,j Or-1,j 
set 0;,; = 0(az,i) 
backward: 
set dry =Or-—y 
fort =T—1,T-—2,...,1 
fori=1,...,k; 
ori = ae Wi, j,i Ory, j o' (ar41,;) 
output: ) 
foreach edge (v-1,;, 1,1) € E 
set the partial derivative to 5,,;0’(a,;) 0r-1, 


Explaining How Backpropagation Calculates the Gradient: 
We next explain how the backpropagation algorithm calculates the gradient of the 
loss function on an example (x,y) with respect to the vector w. Let us first recall 
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a few definitions from vector calculus. Each element of the gradient is the partial 
derivative with respect to the variable in w corresponding to one of the edges of the 
network. Recall the definition of a partial derivative. Given a function f : R” > R, 
the partial derivative with respect to the ith variable at w is obtained by fixing the 
values of w1,...,W;-1, Wi41, Wn, Which yields the scalar function g : R > R defined 
by g(a) = f((w,..., Wi-1, Wi + 2, W41,---, Wn)), and then taking the derivative of g 
at 0. For a function with multiple outputs, f: R” > R”, the Jacobian of f at w € R", 
denoted Jy(f), is the m x n matrix whose i, j element is the partial derivative of fj : 
R” > Rw.r.t. its jth variable at w. Note that if m = 1 then the Jacobian matrix is the 
gradient of the function (represented as a row vector). Two examples of Jacobian 
calculations, which we will later use, are as follows. 


@ Let f(w) = Aw for A €R””. Then Jy(f) = A. 

®™ For every n, we use the notation o to denote the function from R” to R” which 
applies the sigmoid function element-wise. That is, « = o0(@) means that for 
every i we have a; =o (0;) = It is easy to verify that Jg(o) is a diago- 


1 
1+exp(—6;)° 
nal matrix whose (i,i) entry is o'(6;), where o’ is the derivative function of the 

% q ¥ "D.\ — 1 
Scalar) aipmaid! function, namely, o' (6; )= (eit Cay” We also use the 
notation diag(o’(0)) to denote this matrix. 


The chain rule for taking the derivative of a composition of functions can be 
written in terms of the Jacobian as follows. Given two functions f: R” — R” and 
g: Rk — R", we have that the Jacobian of the composition function, (fog): R‘ > R”, 
at w, is 


Jw(fog) = J ow) (f) Jw(g). 


For example, for g(w) = Aw, where A € R’*, we have that 
Jw(o og) = diag(a’(Aw)) A. 


To describe the backpropagation algorithm, let us first decompose V into the 
layers of the graph, V = a V;. For every t, let us write V; = {v,1,...,U:,x,}, where 
k, =|V;|. In addition, for every t denote W, € R*+1-* a matrix which gives a weight to 
every potential edge between V, and V,1. If the edge exists in E then we set W;,;,; to 
be the weight, according to w, of the edge (uv, ;, v;41,;). Otherwise, we add a “phan- 
tom” edge and set its weight to be zero, W,,;,; = 0. Since when calculating the partial 
derivative with respect to the weight of some edge we fix all other weights, these 
additional “phantom” edges have no effect on the partial derivative with respect 
to existing edges. It follows that we can assume, without loss of generality, that all 
edges exist, that is, E = U,(V; x V;+1). 

Next, we discuss how to calculate the partial derivatives with respect to the edges 
from V;_; to V;, namely, with respect to the elements in W,_;. Since we fix all other 
weights of the network, it follows that the outputs of all the neurons in V,_ are fixed 
numbers which do not depend on the weights in W,_;. Denote the corresponding 
vector by 0;_1. In addition, let us denote by 2; : IR‘ —» R the loss function of the 
subnetwork defined by layers V;,..., Vr as a function of the outputs of the neurons 
in V,. The input to the neurons of V; can be written as a, = W;_10;_1 and the output 
of the neurons of V; is 0; = o(a;). That is, for every j we have o;,; = o(a,;). We 
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obtain that the loss, as a function of W;_1, can be written as 


81(W;-1) = £:(0;) = C:(0 (ar)) = €:(0 (W;_10;-1)). 


It would be convenient to rewrite this as follows. Let w,;_; € R*-1" be the column 


vector obtained by concatenating the rows of W,_, and then taking the transpose of 
the resulting long vector. Define by O;_1 the k; x (k;-1k+) matrix 


of, O + 0 
0 of, 0 

Or= | | oo » lhe (20.2) 
0 0 O11 


Then, W,;_10;_1 = O;-1W;—1, SO we can also write 


81(Wr-1) = €:(0(O;-1 Wr-1)). 


Therefore, applying the chain rule, we obtain that 
Iw, (81) = Jo(o,yw,1)(r) diag(o'(O;,-1W;-1)) Or-1- 
Using our notation we have 0; = 0(O;-1W;_1) and a; = O;—1W;—1, which yields 
Jw, (8t) = Jo, (61) diag(o’(a,)) Or-1. 


Let us also denote 5, = Jo, (€;). Then, we can further rewrite the preceding as 


Jw, (81) = (5;,10(ar.1) 0/4 porte 51k, © (Ark) O;_1) - (20.3) 


It is left to calculate the vector 5; = Jo,(€+) for every t. This is the gradient of €, 
at o,. We calculate this in a recursive manner. First observe that for the last layer 
we have that &r(u) = A(u,y), where A is the loss function. Since we assume that 
A(u, y) = 5 lu —y\|? we obtain that Jy(¢r) = (u—y). In particular, 67 = Jo, (Er) = 
(or —y). Next, note that 


£,(u) = €;41(0 (W,u)). 


Therefore, by the chain rule, 


Ju(lr) = Jo(w,uy(r+1)diag(o"(W,u))W;. 


In particular, 
8; = Jo, (e1) = Jo (Wyo) (r+1)diag(o’(W,0;)) W; 
= Jo,,, (Cr41)diag(o'(a;41)) Wi 


=> Or41 diag(o'(a;+1))W;. 


In summary, we can first calculate the vectors {a;,0,} from the bottom of the 
network to its top. Then, we calculate the vectors {6;} from the top of the network 
back to its bottom. Once we have all of these vectors, the partial derivatives are 
easily obtained using Equation (20.3). We have thus shown that the pseudocode of 
backpropagation indeed calculates the gradient. 
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20.7 SUMMARY 


Neural networks over graphs of size s(n) can be used to describe hypothesis classes 
of all predictors that can be implemented in runtime of O(,/s(n)). We have also 
shown that their sample complexity depends polynomially on s(n) (specifically, 
it depends on the number of edges in the network). Therefore, classes of neu- 
ral network hypotheses seem to be an excellent choice. Regrettably, the problem 
of training the network on the basis of training data is computationally hard. We 
have presented the SGD framework as a heuristic approach for training neural net- 
works and described the backpropagation algorithm which efficiently calculates the 
gradient of the loss function with respect to the weights over the edges. 


20.8 BIBLIOGRAPHIC REMARKS 


Neural networks were extensively studied in the 1980s and early 1990s, but with 
mixed empirical success. In recent years, a combination of algorithmic advance- 
ments, as well as increasing computational power and data size, has led to a 
breakthrough in the effectiveness of neural networks. In particular, “deep net- 
works” (i.e., networks of more than 2 layers) have shown very impressive practical 
performance on a variety of domains. A few examples include convolutional net- 
works (LeCun & Bengio 1995), restricted Boltzmann machines (Hinton, Osindero 
& Teh 2006), auto-encoders (Ranzato et al. 2007, Bengio & LeCun 2007, Collobert 
& Weston 2008, Lee et al. 2009, Le et al. 2012), and sum-product networks (Livni, 
Shalev-Shwartz & Shamir 2013, Poon & Domingos 2011). See also (Bengio 2009) 
and the references therein. 

The expressive power of neural networks and the relation to circuit complexity 
have been extensively studied in (Parberry 1994). For the analysis of the sample 
complexity of neural networks we refer the reader to (Anthony & Bartlet 1999). 
Our proof technique of Theorem 20.6 is due to Kakade and Tewari lecture notes. 

Klivans and Sherstov (2006) have shown that for any c > 0, intersections of 
n° halfspaces over {+1}” are not efficiently PAC learnable, even if we allow rep- 
resentation independent learning. This hardness result relies on the cryptographic 
assumption that there is no polynomial time solution to the unique-shortest-vector 
problem. As we have argued, this implies that there cannot be an efficient algorithm 
for training neural networks, even if we allow larger networks or other activation 
functions that can be implemented efficiently. 

The backpropagation algorithm has been introduced in Rumelhart, Hinton, and 
Williams (1986). 


20.9 EXERCISES 


20.1 Neural Networks are universal approximators: Let f :[—1,1]" > [—1,1] bea 
p-Lipschitz function. Fix some € > 0. Construct a neural network N :[—1,1]" > 
[— 1,1], with the sigmoid activation function, such that for every x ¢ [—1,1]" it 
holds that | f(x) — N(x)| <. 
Hint: Similarly to the proof of Theorem 19.3, partition [ —1,1]" into small boxes. 
Use the Lipschitzness of f to show that it is approximately constant at each box. 
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20.9 Exercises 


Finally, show that a neural network can first decide which box the input vector 

belongs to, and then predict the averaged value of f at that box. 

Prove Theorem 20.5. 

Hint: For every f : {—1,1}" — {-1,1} construct a 1-Lipschitz function g : 

[—1,1]" ~ [—1,1] such that if you can approximate g then you can express f. 

Growth function of product: For i = 1,2, let 7; be a set of functions from 4% to Jj. 

Define H = F, x Fp to be the Cartesian product class. That is, for every fi € Fi 

and f2 € Fo, there exists h € H such that h(x) = (/fi(x), f2(x)). Prove that tz,(m) < 

TF, (m) TF (m). 

Growth function of composition: Let 7; be a set of functions from %¥ to Z and let 

F> be a set of functions from Z to Y. Let H = F20F; be the composition class. That 

is, for every f; € Fj and f2 € Fy, there exists h € H such that h(x) = fo(fi(x)). Prove 

that t7,(m) < THy (m)tF, (m). 

VC of sigmoidal networks: In this exercise we show that there is a graph (V, £) 

such that the VC dimension of the class of neural networks over these graphs with 

the sigmoid activation function is Q(|E|*). Note that for every € > 0, the sigmoid 

activation function can approximate the threshold activation function, Ljs>, x;], up 

to accuracy e. To simplify the presentation, throughout the exercise we assume 

that we can exactly implement the activation function I>, ,,~0] using a sigmoid 

activation function. 

Fix some n. 

1. Construct a network, Nj, with O(n) weights, which implements a function from 
R to {0, 1}” and satisfies the following property. For every x € {0, 1}”, if we feed 
the network with the real number 0.x1x2...x,, then the output of the network 


will be x. 
Hint: Denote a =0.x1x2...x, and observe that 10*a — 0.5 is at least 0.5 if x, =1 
and is at most —0.3 if x, = —1. 


2. Construct a network, N2, with O(n) weights, which implements a function from 
[n] to {0, 1}” such that N2(i) = e; for alli. That is, upon receiving the input i, the 
network outputs the vector of all zeros except 1 at the 7’th neuron. 

OY @O —@ 

1.42 


3. Let a1,...,a@, ben real numbers such that every a; is of the form 0.a sn’ 5 


with a” < {0, 1}. Construct a network, N3, with O(n) weights, which implements 
a function from [n] to R, and satisfies N2(i) = a; for every i € [n]. 
4. Combine N;, N3 to obtain a network that receives i € [n] and output a“. 


5. Construct a network N4 that receives (i, j) € [n] x [n] and outputs a 


Hint: Observe that the AND function over {0, 1}? can be calculated using O(1) 
weights. 

6. Conclude that there is a graph with O(n) weights such that the VC dimension 
of the resulting hypothesis class is n”. 

Prove Theorem 20.7. 

Hint: The proof is similar to the hardness of learning intersections of halfspaces — 

see Exercise 32 in Chapter 8. 
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In this chapter we describe a different model of learning, which is called online 
learning. Previously, we studied the PAC learning model, in which the learner first 
receives a batch of training examples, uses the training set to learn a hypothesis, 
and only when learning is completed uses the learned hypothesis for predicting 
the label of new examples. In our papayas learning problem, this means that we 
should first buy a bunch of papayas and taste them all. Then, we use all of this 
information to learn a prediction rule that determines the taste of new papayas. In 
contrast, in online learning there is no separation between a training phase and a 
prediction phase. Instead, each time we buy a papaya, it is first considered a test 
example since we should predict whether it is going to taste good. Then, after taking 
a bite from the papaya, we know the true label, and the same papaya can be used 
as a training example that can help us improve our prediction mechanism for future 
papayas. 

Concretely, online learning takes place in a sequence of consecutive rounds. 
On each online round, the learner first receives an instance (the learner buys a 
papaya and knows its shape and color, which form the instance). Then, the learner 
is required to predict a label (is the papaya tasty?). At the end of the round, the 
learner obtains the correct label (he tastes the papaya and then knows whether 
it is tasty or not). Finally, the learner uses this information to improve his future 
predictions. 

To analyze online learning, we follow a similar route to our study of PAC 
learning. We start with online binary classification problems. We consider both the 
realizable case, in which we assume, as prior knowledge, that all the labels are gen- 
erated by some hypothesis from a given hypothesis class, and the unrealizable case, 
which corresponds to the agnostic PAC learning model. In particular, we present 
an important algorithm called Weighted-Majority. Next, we study online learning 
problems in which the loss function is convex. Finally, we present the Perceptron 
algorithm as an example of the use of surrogate convex loss functions in the online 
learning model. 
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21.1 ONLINE CLASSIFICATION IN THE REALIZABLE CASE 


Online learning is performed in a sequence of consecutive rounds, where at round 
t the learner is given an instance, x;, taken from an instance domain %, and is 
required to provide its label. We denote the predicted label by p;. After predicting 
the label, the correct label, y; € {0,1}, is revealed to the learner. The learner’s goal 
is to make as few prediction mistakes as possible during this process. The learner 
tries to deduce information from previous rounds so as to improve its predictions 
on future rounds. 

Clearly, learning is hopeless if there is no correlation between past and present 
rounds. Previously in the book, we studied the PAC model in which we assume that 
past and present examples are sampled 1.i.d. from the same distribution source. In 
the online learning model we make no statistical assumptions regarding the origin 
of the sequence of examples. The sequence is allowed to be deterministic, stochas- 
tic, or even adversarially adaptive to the learner’s own behavior (as in the case of 
spam e-mail filtering). Naturally, an adversary can make the number of predic- 
tion mistakes of our online learning algorithm arbitrarily large. For example, the 
adversary can present the same instance on each online round, wait for the learner’s 
prediction, and provide the opposite label as the correct label. 

To make nontrivial statements we must further restrict the problem. The real- 
izability assumption is one possible natural restriction. In the realizable case, we 
assume that all the labels are generated by some hypothesis, h* : ¥ — Y. Further- 
more, h* is taken from a hypothesis class 1, which is known to the learner. This is 
analogous to the PAC learning model we studied in Chapter 3. With this restriction 
on the sequence, the learner should make as few mistakes as possible, assuming 
that both h* and the sequence of instances can be chosen by an adversary. For an 
online learning algorithm, A, we denote by M4(H) the maximal number of mistakes 
A might make on a sequence of examples which is labeled by some h* € H. We 
emphasize again that both h* and the sequence of instances can be chosen by an 
adversary. A bound on M,(H) is called a mistake-bound and we will study how to 
design algorithms for which M4(H) is minimal. Formally: 


Definition 21.1 (Mistake Bounds, Online Learnability). Let H be a hypoth- 
esis class and let A be an online learning algorithm. Given any sequence 
S = (x1,h*(y1)),..., (xr, 4*(yr)), where T is any integer and h* € H, let Ma(S) be 
the number of mistakes A makes on the sequence S. We denote by Ma4(H) the 
supremum of M,4(S) over all sequences of the preceding form. A bound of the form 
Ma(H) < B < oois called a mistake bound. We say that a hypothesis class H is online 
learnable if there exists an algorithm A for which M,4(H) < B < co. 


Our goal is to study which hypothesis classes are learnable in the online model, 
and in particular to find good learning algorithms for a given hypothesis class. 


Remark 21.1. Throughout this section and the next, we ignore the computational 
aspect of learning, and do not restrict the algorithms to be efficient. In Section 21.3 
and Section 21.4 we study efficient online learning algorithms. 


To simplify the presentation, we start with the case of a finite hypothesis class, 
namely, |H| < oo. 
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In PAC learning, we identified ERM as a good learning algorithm, in the sense 
that if is learnable then it is learnable by the rule ERM. A natural learning rule 
for online learning is to use (at any online round) any ERM hypothesis, namely, any 
hypothesis which is consistent with all past examples. 


Consistent 


input: A finite hypothesis class H 
initialize: V,; =H. 
fort =1,2,... 


receive Xx; 

choose any h € V; 

predict p; = h(x,) 

receive true label y,; = h*(x;) 
update V,41 = {he V,: h(x) = yz} 


The Consistent algorithm maintains a set, V;, of all the hypotheses which are 
consistent with (x1, y1),...,(X;-1, ¥-1). This set is often called the version space. It 
then picks any hypothesis from V; and predicts according to this hypothesis. 

Obviously, whenever Consistent makes a prediction mistake, at least one hypoth- 
esis is removed from V;. Therefore, after making M mistakes we have |V;| < |H|— M. 
Since V; is always nonempty (by the realizability assumption it contains h*) we have 
1 <|V;| < |H| — M. Rearranging, we obtain the following: 


Corollary 21.2. Let H. be a finite hypothesis class. The Consistent algorithm enjoys the 
mistake bound Meonsistent(H) < |H| — 1. 


It is rather easy to construct a hypothesis class and a sequence of examples on 
which Consistent will indeed make |H| — 1 mistakes (see Exercise 21.1.) Therefore, 
we present a better algorithm in which we choose h € V; in a smarter way. We shall 
see that this algorithm is guaranteed to make exponentially fewer mistakes. 


Halving 
input: A finite hypothesis class H 
initialize: V,; = H. 
fort =1,2,... 


receive Xx; 

predict p; = argmax,cig.1) {2 € Vi: A(x+) =r}I 
(in case of a tie predict p, = 1) 

receive true label y, = h*(x;) 

update V,41 = {he Vi :h(xr) = yr} 


Theorem 21.3. Let H be a finite hypothesis class. The Halving algorithm enjoys the 
mistake bound Muyawing(H) < log, (\H|). 
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Proof. We simply note that whenever the algorithm errs we have |V;41| < |V;|/2, 
(hence the name Halving). Therefore, if M is the total number of mistakes, we have 


1<|Vrsil </H|2™. 
Rearranging this inequality we conclude our proof. oO 


Of course, Halving’s mistake bound is much better than Consistent’s mistake 
bound. We already see that online learning is different from PAC learning—while 
in PAC, any ERM hypothesis is good, in online learning choosing an arbitrary ERM 
hypothesis is far from being optimal. 


21.1.1 Online Learnability 


We next take a more general approach, and aim at characterizing online learnability. 
In particular, we target the following question: What is the optimal online learning 
algorithm for a given hypothesis class 1? 

We present a dimension of hypothesis classes that characterizes the best achiev- 
able mistake bound. This measure was proposed by Nick Littlestone and we 
therefore refer to it as Ldim(7). 

To motivate the definition of Ldim it is convenient to view the online learning 
process as a game between two players: the learner versus the environment. On 
round ¢ of the game, the environment picks an instance x;, the learner predicts a 
label p; € {0, 1}, and finally the environment outputs the true label, y, € {0, 1}. Sup- 
pose that the environment wants to make the learner err on the first T rounds of the 
game. Then, it must output y, = 1 — p;, and the only question is how it should choose 
the instances x, in such a way that ensures that for some h* € H we have y, = h*(x;) 
for all t € [T]. 

A strategy for an adversarial environment can be formally described as a binary 
tree, as follows. Each node of the tree is associated with an instance from V. Initially, 
the environment presents to the learner the instance associated with the root of the 
tree. Then, if the learner predicts p,; = 1 the environment will declare that this is a 
wrong prediction (i.e., y; = 0) and will traverse to the right child of the current node. 
If the learner predicts p; = 0 then the environment will set y, = 1 and will traverse 
to the left child. This process will continue and at each round, the environment will 
present the instance associated with the current node. 

Formally, consider a complete binary tree of depth T (we define the depth of 
the tree as the number of edges in a path from the root to a leaf). We have 27+! —1 
nodes in such a tree, and we attach an instance to each node. Let w1,...,Vyr+1_; be 
these instances. We start from the root of the tree, and set x} = vj. At round fr, we 
set x, = v;, where i; is the current node. At the end of round f, we go to the left child 
of i; if y; =0 or to the right child if y; = 1. That is, 7,4; = 27; + y;. Unraveling the 
recursion we obtain i, = 2'~! + Sy'7) yj2'-1/, 

The preceding strategy for the environment succeeds only if for every 
(y1,---, yr) there exists h € H such that y, = h(x,) for all t € [T]. This leads to the 
following definition. 


Definition 21.4 (7H Shattered Tree). A shattered tree of depth d is a sequence 
of instances vj,...,V>¢_, in X such that for every labeling (y,..., ya) € {0,1}4 
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% 
\ 


vy O 0 1 1 
: vy O 1 * * 
Z \ J vy, * * O 1 
@ e 
Figure 21.1. An illustration of a shattered tree of depth 2. The dashed path corresponds 
to the sequence of examples ((v;, 1), (v3,0)). The tree is shattered by H = {h1,h2,h3,h4}, 


where the predictions of each hypothesis in H on the instances yj, v2, v3 is given in the 
table (the * mark means that / ;(v;) can be either 1 or 0). 


there exists h € H such that for all t € [d] we have h(v;,) = y, where i, = 2’! + 
YS : 2t-1-j 
j=l yj ‘ 
An illustration of a shattered tree of depth 2 is given in Figure 21.1. 


Definition 21.5 (Littlestone’s Dimension (Ldim)). Ldim(H) is the maximal integer 
T such that there exists a shattered tree of depth T, which is shattered by H. 


The definition of Ldim and the previous discussion immediately imply the 
following: 


Lemma 21.6. No algorithm can have a mistake bound strictly smaller than 
Ldim(H); namely, for every algorithm, A, we have M(H) => Ldim(). 


Proof. Let T = Ldim(#) and let vj,...,v.7_; be a sequence that satisfies the 
requirements in the definition of Ldim. If the environment sets x; = v;, and y; = 
1— p; for all t € [T], then the learner makes T mistakes while the definition of Ldim 
implies that there exists a hypothesis h € H such that y, = h(x;) for all r. O 


Let us now give several examples. 


Example 21.2. Let H be a finite hypothesis class. Clearly, any tree that is shattered 
by H has depth of at most log, (|H|). Therefore, Ldim(H) < log, (|H|). Another way 
to conclude this inequality is by combining Lemma 21.6 with Theorem 21.3. 


Example 21.3. Let ¥ = {1,...,d} and H = {hy,...,4a} where h(x) =1 iff x = j. 
Then, it is easy to show that Ldim(H) = 1 while |H| =d can be arbitrarily large. 
Therefore, this example shows that Ldim(H) can be significantly smaller than 


log, (|H|). 


Example 21.4. Let 1 = [0,1] and H = {x +> I, <q): a € [0,1]}; namely, H is the 
class of thresholds on the interval [0, 1]. Then, Ldim(H) = oo. To see this, consider 


the tree 
i. 


o*e ee 


oY %: 
¥ \ I \ f \ 
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This tree is shattered by H. And, because of the density of the reals, this tree can be 
made arbitrarily deep. 


Lemma 21.6 states that Ldim(#.) lower bounds the mistake bound of any algo- 
rithm. Interestingly, there is a standard algorithm whose mistake bound matches this 
lower bound. The algorithm is similar to the Halving algorithm. Recall that the pre- 
diction of Halving is made according to a majority vote of the hypotheses which are 
consistent with previous examples. We denoted this set by V;. Put another way, Halv- 
ing partitions V, into two sets: V,;* = {h € V; :h(x;) = 1} and V,- = {he V,; : h(x;) = 0}. 
It then predicts according to the larger of the two groups. The rationale behind this 
prediction is that whenever Halving makes a mistake it ends up with |V,41| < 0.5|V;|. 

The optimal algorithm we present in the following uses the same idea, but 
instead of predicting according to the larger class, it predicts according to the class 
with larger Ldim. 


Standard Optimal Algorithm (SOA) 


input: A hypothesis class H. 
initialize: V; =H. 
fort =1,2,... 
receive Xx; 
for r € {0,1} let V” = {he V; :h(x,) =r} 


predict p; = argmax,.ciq,1) Ldim(V,”) 
(in case of a tie predict p; = 1) 

receive true label y, 

update V,41 = {he V;: h(x) = yy} 


The following lemma formally establishes the optimality of the preceding 
algorithm. 


Lemma 21.7. SOA enjoys the mistake bound Msoa(H) < Ldim(H). 


Proof. It suffices to prove that whenever the algorithm makes a prediction mistake 
we have Ldim(V,+1) < Ldim(V;) — 1. We prove this claim by assuming the contrary, 
that is, Ldim(V;41) = Ldim(V,). If this holds true, then the definition of p; implies 
that Ldim(V,"”) = Ldim(V,) for both r = 1 and r = 0. But, then we can construct 
a shaterred tree of depth Ldim(V,) + 1 for the class V,, which leads to the desired 
contradiction. O 


Combining Lemma 21.7 and Lemma 21.6 we obtain: 


Corollary 21.8. Let H be any hypothesis class. Then, the standard optimal algo- 
rithm enjoys the mistake bound Msoa(H.) = Ldim(#) and no other algorithm can 
have Ma(H) < Ldim(#). 


Comparison to VC Dimension 

In the PAC learning model, learnability is characterized by the VC dimension of 
the class H. Recall that the VC dimension of a class H is the maximal number d 
such that there are instances x1,...,xqg that are shattered by H. That is, for any 
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sequence of labels (y1,..., ya) € {0,1}“ there exists a hypothesis h € 1 that gives 
exactly this sequence of labels. The following theorem relates the VC dimension to 
the Littlestone dimension. 


Theorem 21.9. For any class H, VCdim(H) < Ldim(#), and there are classes for 
which strict inequality holds. Furthermore, the gap can be arbitrarily larger. 


Proof. We first prove that VCdim(H) < Ldim(H). Suppose VCdim(H) = d and 


let x1,...,Xq¢ be a shattered set. We now construct a complete binary tree of 
instances V,...,V¢_,, where all nodes at depth i are set to be x; — see the following 
illustration: 


ws a 


Now, the definition of a shattered set clearly implies that we got a valid shattered 
tree of depth d, and we conclude that VCdim(H) < Ldim(#). To show that the gap 
can be arbitrarily large simply note that the class given in Example 21.4 has VC 
dimension of 1 whereas its Littlestone dimension is infinite. O 


21.2 ONLINE CLASSIFICATION IN THE UNREALIZABLE CASE 


In the previous section we studied online learnability in the realizable case. We now 
consider the unrealizable case. Similarly to the agnostic PAC model, we no longer 
assume that all labels are generated by some h* € H, but we require the learner to 
be competitive with the best fixed predictor from 1. This is captured by the regret 
of the algorithm, which measures how “sorry” the learner is, in retrospect, not to 
have followed the predictions of some hypothesis h € H. Formally, the regret of an 
algorithm A relative to h when running on a sequence of T examples is defined as 


Regret, (h, T) = sup ie n=) yell] » (21.1) 


(x1.91).-(7.97) [p21 
and the regret of the algorithm relative to a hypothesis class H. is 


Regret, (H, T) = sup Regret, (h, T). (21.2) 
heH 


We restate the learner’s goal as having the lowest possible regret relative to H. An 
interesting question is whether we can derive an algorithm with low regret, meaning 
that Regret, (H, 7) grows sublinearly with the number of rounds, T, which implies 
that the difference between the error rate of the learner and the best hypothesis in 
H tends to zero as T goes to infinity. 

We first show that this is an impossible mission—no algorithm can obtain a 
sublinear regret bound even if |H| = 2. Indeed, consider H = {ho,h1}, where ho 
is the function that always returns 0 and h is the function that always returns 1. An 
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adversary can make the number of mistakes of any online algorithm be equal to T, 
by simply waiting for the learner’s prediction and then providing the opposite label 
as the true label. In contrast, for any sequence of true labels, y1,..., yr, let b be 
the majority of labels in y;,..., yr, then the number of mistakes of hy is at most T /2. 
Therefore, the regret of any online algorithm might be at least T — T /2 = T /2, which 
is not sublinear in 7. This impossibility result is attributed to Cover (Cover 1965). 

To sidestep Cover’s impossibility result, we must further restrict the power of the 
adversarial environment. We do so by allowing the learner to randomize his predic- 
tions. Of course, this by itself does not circumvent Cover’s impossibility result, since 
in deriving this result we assumed nothing about the learner’s strategy. To make the 
randomization meaningful, we force the adversarial environment to decide on y,; 
without knowing the random coins flipped by the learner on round t. The adversary 
can still know the learner’s forecasting strategy and even the random coin flips of 
previous rounds, but it does not know the actual value of the random coin flips used 
by the learner on round t. With this (mild) change of game, we analyze the expected 
number of mistakes of the algorithm, where the expectation is with respect to the 
learner’s own randomization. That is, if the learner outputs , where P[5, =1]= p,, 
then the expected loss he pays on round f is 


P[S, #y:]= [Pt — Yrl- 


Put another way, instead of having the predictions of the learner being in {0,1} we 
allow them to be in [0,1], and interpret p; € [0,1] as the probability to predict the 
label 1 on round ft. 

With this assumption it is possible to derive a low regret algorithm. In particular, 
we will prove the following theorem. 


Theorem 21.10. For every hypothesis class H, there exists an algorithm for online 
classification, whose predictions come from [0,1], that enjoys the regret bound 


F T 


Whe H, S- |Pt — Yi] — S- |h(xr) — y:| < V2 min{log (JH|), Ldim(H) log (eT)}T. 


t=1 t=1 


Furthermore, no algorithm can achieve an expected regret bound smaller than 
Q ( /Ldim(H) T). 


We will provide a constructive proof of the upper bound part of the preceding 
theorem. The proof of the lower bound part can be found in (Ben-David, Pal, & 
Shalev-Shwartz 2009). 

The proof of Theorem 21.10 relies on the Weighted-Majority algorithm for learn- 
ing with expert advice. This algorithm is important by itself and we dedicate the next 
subsection to it. 


21.2.1 Weighted-Majority 


Weighted-majority is an algorithm for the problem of prediction with expert advice. 
In this online learning problem, on round t the learner has to choose the advice 
of d given experts. We also allow the learner to randomize his choice by defin- 
ing a distribution over the d experts, that is, picking a vector w”) € [0,1]¢, with 
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wl? = 1, and choosing the ith expert with probability wr”, After the learner 


chooses an expert, it receives a vector of costs, v, € [0, if , where v;,; 1s the cost of 
following the advice of the ith expert. If the learner’s predictions are randomized, 
then its loss is defined to be the averaged cost, namely, >>; wo vy i = (w) v,). The 
algorithm assumes that the number of rounds T is given. In Exercise 21.4 we show 
how to get rid of this dependence using the doubling trick. 


Weighted-Majority 


input: number of experts, d ; number of rounds, T 
parameter: 7 = \/2 log(d)/T 
initialize: w") = (1,..., 1) 
for t=1,2,... 
set w") = w)/Z, where Z, =; wm 
choose expert i at random according to P[i] = w? 
receive costs of all experts v, € [0, 1]? 
pay cost (w, v;) 


update rule Vi, ot) = DO erm 


The following theorem is key for analyzing the regret bound of Weighted- 
Majority. 


Theorem 21.11. Assuming that T > 2log(d), the Weighted-Majority algorithm enjoys 
the bound 


Ec T 


Sw, v;) —min Ut, i 


<4 ie[d] = 


V2 log(d)T. 


IA 


Proof. We have: 


~(t) 
Zi+1 S- Wi ny 5 (1) (—ny,i 
log Ze => log oe (a Mii log W; (a 0 es 


Using the inequality e~¢ < 1—a-+a*/2, which holds for all a € (0,1), and the fact 
that 0; ur” = 1, we obtain 


log 


Zi41 
z ES log yl” (1 = Uni + up, /2) 


= log(l — Sout? (ne. = ups /2)). 


ae, 


Next, note that b € (0,1). Therefore, taking log of the two sides of the inequality 
1 —b <e~? we obtain the inequality log(1 — b) < —b, which holds for all b < 1, 
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and obtain 


log oe <— : wr? (ner. _ up /2) 
t P 
wy oe tS al 


<n (w,v;) +1? /2. 
Summing this inequality over t we get 


T 


T 2 
log (Zr41) —log(Z,) = Slog Ze = s—n wv) +. (21.3) 
t=1 t=1 


—(T+1) _ 


Next, we lower bound Z74,. For each i, we can rewrite w, e721 %i and we 


get that 


log Zr41 = log b3 creo) > log (axe) = —nmin ) vi. 
I L t 


i 


Combining the preceding with Equation (21.3) and using the fact that log(Z1) = 
log (d) we get that 


T 
—nmin ) > vi —log(d) < - nS —>(w,v;) + — 
t t=1 


which can be rearranged as follows: 


Mas 


log (d T 
(w,v,) —min S> v1 < o8( ae 
u U] 
t 


ll 
e 


t 


Plugging the value of 7 into the equation concludes our proof. O 


Proof of Theorem 21.10 


Equipped with the Weighted-Majority algorithm and Theorem 21.11, we are ready to 
prove Theorem 21.10. We start with the simpler case, in which H is a finite class, 
and let us write H = {h1,...,ha}. In this case, we can refer to each hypothesis, h;, 
as an expert, whose advice is to predict h;(x;), and whose cost is v;,; = |Ai(x:) — yy|. 
The prediction of the algorithm will therefore be p; = 50; wl hi (X;) € [0, 1], and the 
loss is 


d 
=|S-w{?(hi(x,) — y1)]. 


i=l 


d 
So wl? hi(x:) — Yt 


i=l 


IPr— Yel = 


Now, if y, = 1, then for all i, h;(x,) — y; < 0. Therefore, the above equals to 
wr” [hi (x) — y,|. If y, =0 then for alli, h;(x,) — y, => 0, and the above also equals 
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yy; wr” (nj (X;) — y;,|. Allin all, we have shown that 


d 
[Pr —yel = > wy |hi(x:) — yel = (WO, v,). 
i=l 


Furthermore, for each i, >>, v,; is exactly the number of mistakes hypothesis h; 
makes. Applying Theorem 21.11 we obtain 


Corollary 21.12. Let H be a finite hypothesis class. There exists an algorithm for 
online classification, whose predictions come from [0,1], that enjoys the regret bound 


T T 
Dlr el — i) Weta) — a < V2 log (HIT. 
t= i 


Next, we consider the case of a general hypothesis class. Previously, we con- 
structed an expert for each individual hypothesis. However, if H is infinite this leads 
to a vacuous bound. The main idea is to construct a set of experts in a more sophis- 
ticated way. The challenge is how to define a set of experts that, on one hand, is 
not excessively large and, on the other hand, contains experts that give accurate 


predictions. 
We construct the set of experts so that for each hypothesis h € H and every 
sequence of instances, x1,x2,...,X7, there exists at least one expert in the set which 


behaves exactly as A on these instances. For each L < Ldim(H) and each sequence 
1 <i <in <---<iz <T we define an expert. The expert simulates the game between 
SOA (presented in the previous section) and the environment on the sequence 
of instances x;,X2,...,X7 assuming that SOA makes a mistake precisely in rounds 
ij,12,...,4z. The expert is defined by the following algorithm. 


Expert (i1,i2,...,iL) 


input A hypothesis class H ; Indices i, <iz <--- <iz 
initialize: V; =H. 
for ¢=1,2,...,T 

receive Xx; 

for r € {0,1} let V, = {he V;:h(x,) =r} 


define ¥, = argmax, Ldim (ve?) 


(in case of a tie set y, = 0) 
if te {i1,i2,...,iz} 
predict j, =1-}, 
else 
predict J; = y, 
update V;,,; =V,°” 


Note that each such expert can give us predictions at every round t while only 
observing the instances x;,...,x;. Our generic online learning algorithm is now an 
application of the Weighted-Majority algorithm with these experts. 
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To analyze the algorithm we first note that the number of experts is 


in 5 (1): eu) 


It can be shown that when T > Ldim(#) + 2, the right-hand side of the equation is 
bounded by (eT /Ldim(H))o"™ (the proof can be found in Lemma A.5). 

Theorem 21.11 tells us that the expected number of mistakes of Weighted- 
Majority is at most the number of mistakes of the best expert plus \/2log(d)T. We 
will next show that the number of mistakes of the best expert is at most the number 
of mistakes of the best hypothesis in H. The following key lemma shows that, on 
any sequence of instances, for each hypothesis h € H there exists an expert with the 
same behavior. 


Lemma 21.13. Let H be any hypothesis class with Ldim(H) < oo. Let x1,X2,...,XT 
be any sequence of instances. For any h € H, there exists L < Ldim(H) and indices 


1 <i, <ig <--- <i, <T such that when running Expett (i1, i2,...,i7) on the sequence 
X1,X2,...,X7, the expert predicts h(x,) on each online round t =1,2,...,T. 
Proof. Fix h € H and the sequence x1,x2,...,x7. We must construct L and the 
indices i1,i2,...,i,. Consider running SOA on the input (x1, A(x1)), (x2, 4(x2)), ..., 
(xr, h(xr)). SOA makes at most Ldim(H) mistakes on such input. We define L to 
be the number of mistakes made by SOA and we define {i1,i2,...,iz} to be the set 
of rounds in which SOA made the mistakes. 

Now, consider the Expert (i1,i2,...,i,) running on the sequence x1,%2,...,Xr. 
By construction, the set V, maintained by Expert(i1,i2,...,i,) equals the set V, 


maintained by SOA when running on the sequence (x;,/(x;)),..., (x7, 4(xr)). The 
predictions of SOA differ from the predictions of / if and only if the round is 
in {i1,i2,...,iz}. Since Expert (i1,i2,...,i,) predicts exactly like SOA if t is not 


in {i1,i2,...,iz} and the opposite of SOAs’ predictions if ¢ is in {i1,i2,...,iz}, we 
conclude that the predictions of the expert are always the same as the predic- 
tions of h. O 


The previous lemma holds in particular for the hypothesis in H_ that makes the 
least number of mistakes on the sequence of examples, and we therefore obtain the 
following: 


Corollary 21.14. Let (x1, y1), (x2, y2),..., (xr, yr) be a sequence of examples and let 
H. be a hypothesis class with Ldim(H) < oo. There exists L < Ldim(H) and indices 
1 <i, <i2 <--- <i, <T, such that Expert(it,i2,...,i,) makes at most as many 
mistakes as the best hh € H does, namely, 


T 
min h(x) — 
ee S> |h(x) — yl 
t=1 
mistakes on the sequence of examples. 


Together with Theorem 21.11, the upper bound part of Theorem 21.10 is proven. 
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21.3 ONLINE CONVEX OPTIMIZATION 


In Chapter 12 we studied convex learning problems and showed learnability results 
for these problems in the agnostic PAC learning framework. In this section we 
show that similar learnability results hold for convex problems in the online learning 
framework. In particular, we consider the following problem. 


Online Convex Optimization 


definitions: 

hypothesis class H ; domain Z ; loss function £:H x Z—> R 
assumptions: 

H is convex 

Vz € Z, £(-,z) is a convex function 
fort=1,2,...,T 

learner predicts a vector w) « H 

environment responds with z; € Z 

learner suffers loss €(w, z;) 


As in the online classification problem, we analyze the regret of the algorithm. 
Recall that the regret of an online algorithm with respect to a competing hypothesis, 
which here will be some vector w* € H, is defined as 


T 4 
Regret,(w*,T) = e(w, 21) — )- e(w*, zi). (21.5) 
t=1 t=1 


As before, the regret of the algorithm relative to a set of competing vectors, H, is 
defined as 


Regret,(H,T) = sup Regret,(w”, T). 
weH 


In Chapter 14 we have shown that Stochastic Gradient Descent solves convex 
learning problems in the agnostic PAC model. We now show that a very similar 
algorithm, Online Gradient Descent, solves online convex learning problems. 


Online Gradient Descent 


parameter: 7 > 0 
initialize: w“) = 0 
fort =1,2,...,T 
predict w”? 
receive z; and let f;(-) = £(-, z;) 
choose y; € 0 f;(w) 
update: 


1. ws) = w — ny, 


2. w+) — aremin,,-4, |lw— wit) 
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Theorem 21.15. The Online Gradient Descent algorithm enjoys the following regret 
bound for every w* € H, 


T 
eqyelW lr 7 
Regret,(w*,T) = — +5 D livill. 
t=1 


If we further assume that f, is p-Lipschitz for all t, then setting n =1//T yields 
1 
Regret, (w*,T) < 5 (llw* I? + pwr. 
If we further assume that H. is B-bounded and we set n = a then 


Regret, (H,T) < BpvT. 


Proof. The analysis is similar to the analysis of Stochastic Gradient Descent with 


projections. Using the projection lemma, the definition of w +3), and the definition 
of subgradients, we have that for every f, 


wt D = w* ||? = \|w) aad w* ||? 


1 1 
= wD — wt? — wl 2) — w"]]? + [ew 2) — wf? — fw — Ww"? 


1 
< ||wt 2) — wr |? — [pw — we ||? 


= ||w — ny, — w* |? — ||w — w* |? 
= —2n(w — wv.) +17 |Ivell? 
<—2n(fi(w) — fe(w*)) +07 Ilvill?. 


Summing over ¢ and observing that the left-hand side is a telescopic sum we 
obtain that 


T T 
Iw) — we |)? — Jw — w*I)? < -29 S> fw) — fiw*)) +0? Se Ivell?. 
t=1 t=1 
Rearranging the inequality and using the fact that w) = 0, we get that 
_ w+) —w*||" T 


7 2 
On > ha 


T (1) _ w* 
32 Uiw)— pw < We owl? 


t=1 
T 

Iw? on 2 

<—— +55 Ill’. 

= 5 Ilvr | 
t=1 


This proves the first bound in the theorem. The second bound follows from the 
assumption that f; is p-Lipschitz, which implies that ||v;|| < p. O 


21.4 THE ONLINE PERCEPTRON ALGORITHM 


2: 


The Perceptron is a classic online learning algorithm for binary classification with 
the hypothesis class of homogenous halfspaces, namely, H = {x +> sign((w,x)) : 
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w € R“}. In Section 9.1.2 we have presented the batch version of the Perceptron, 
which aims to solve the ERM problem with respect to H. We now present an online 
version of the Perceptron algorithm. 

Let ¥ =R?, V={-1, 1}. On round f, the learner receives a vector x; € R¢. The 
learner maintains a weight vector w“) € R¢ and predicts p; = sign((w“),x,)). Then, 
it receives y, € Y and pays 1 if p; 4 y; and 0 otherwise. 

The goal of the learner is to make as few prediction mistakes as possible. In 
Section 21.1 we characterized the optimal algorithm and showed that the best 
achievable mistake bound depends on the Littlestone dimension of the class. We 
show later that if d > 2 then Ldim(H) = ov, which implies that we have no hope 
of making few prediction mistakes. Indeed, consider the tree for which v1 = 
(ol OwaD); wW= (4.10; 22,0), vV3= (7,1,0,...,0), etc. Because of the density 
of the reals, this tree is shattered by the subset of which contains all hypothe- 
ses that are parametrized by w of the form w = (—1,a,0,...,0), for a € [0,1]. We 
conclude that indeed Ldim(H) = oo. 

To sidestep this impossibility result, the Perceptron algorithm relies on the tech- 
nique of surrogate convex losses (see Section 12.3). This is also closely related to the 
notion of margin we studied in Chapter 15. 

A weight vector w makes a mistake on an example (x, y) whenever the sign of 
(w, x) does not equal y. Therefore, we can write the 0—1 loss function as follows 


€(w, (x, y)) = Uy iw.x) <0): 


On rounds on which the algorithm makes a prediction mistake, we shall use the 
hinge-loss as a surrogate convex loss function 


f:(w) =max{0, 1 — y;(w,x;)}. 
The hinge-loss satisfies the two conditions: 


jf, is a convex function 
| For all w, f;(w) = €(w, (x;, y;)). In particular, this holds for w“), 


a 


On rounds on which the algorithm is correct, we shall define f,(w) = 0. Clearly, f; 
is convex in this case as well. Furthermore, f;(w) = ¢(w”, (x;, y:)) =0. 


Remark 21.5. In Section 12.3 we used the same surrogate loss function for all the 
examples. In the online model, we allow the surrogate to depend on the specific 
round. It can even depend on w"). Our ability to use a round specific surrogate 
stems from the worst-case type of analysis we employ in online learning. 


Let us now run the Online Gradient Descent algorithm on the sequence of func- 
tions, f{,..., fr, with the hypothesis class being all vectors in R¢ (hence, the 
projection step is vacuous). Recall that the algorithm initializes w") = 0 and its 
update rule is 


w= a 


for some v, € 0 f;(w™). In our case, if y,(w,x;) > 0 then f; is the zero function and 
we can take v; = 0. Otherwise, it is easy to verify that v; = —y;x; is in 0 filw" )), We 
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therefore obtain the update rule 


tt) _ w) if Yt (w), x;) >0 
w) + ny,x; otherwise 


Denote by M the set of rounds in which sign((w”,x,)) 4 y;. Note that on round f, 
the prediction of the Perceptron can be rewritten as 


pr = sign((w”,x;)) = sign (: S- Vi wx) : 


ieMi:i<t 


This form implies that the predictions of the Perceptron algorithm and the set M 
do not depend on the actual value of 7 as long as 7 > 0. We have therefore obtained 
the Perceptron algorithm: 


Perceptron 


initialize: w; = 0 
for t=1,2,...,T 
receive Xx; 


predict p,; = sign((w,x,)) 
if y,(w,x;) <0 
wttl) = wo!) + yrXt 
else 
wit) = w) 


To analyze the Perceptron, we rely on the analysis of Online Gradient Descent 
given in the previous section. In our case, the subgradient of f; we use in the 
Perceptron is v; = =I), (w()x,) <0] Ye Xt Indeed, the Perceptron’s update is with) — 


w) — v,, and as discussed before this is equivalent to wit] = wt) — nv; for every 
n > 0. Therefore, Theorem 21.15 tells us that 


T T T 
a ee n 
So fiw) — SO fiw’) < 5; Iw I+ 5 Do lvell3. 
t=1 t=1 t=1 


Since f;(w")) is a surrogate for the 0—1 loss we know that S77_, f;(w”) > |M|. 
Denote R = max, ||x;||; then we obtain 


T 
ies! ee 
IM|—S> filw’) < 3 3+ 51M? 
t=1 


\lw* || 


RVIM| 


Setting n = and rearranging, we obtain 


T 
IM|— Rilw VIM) — $= fit) <0. (21.6) 
t=1 


This inequality implies 
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Theorem 21.16. Suppose that the Perceptron algorithm runs on a sequence 
(x1, y1),---,(xr, yr) and let R = max, ||x;||. Let M be the rounds on which the 
Perceptron errs and let f;(w) = Uren [1 — yw, X+)],. Then, for every w* 


IM) < So fiw*) + Rilw'll [So fi(w*) + R? [lw |? . 


In particular, if there exists w* such that y,(w* ,x;) = 1 for allt then 
IM| < R* ||w* |. 


Proof. The theorem follows from Equation (21.6) and the following claim: Given 

x,b,c € Ry, the inequality x —b./x — c < 0 implies that x <c +b*+b./c. The last 

claim can be easily derived by analyzing the roots of the convex parabola Q(y) = 
2 

y~ —by—c. O 


The last assumption of Theorem 21.16 is called separability with large margin 
(see Chapter 15). That is, there exists w* that not only satisfies that the point x, lies 
on the correct side of the halfspace, it also guarantees that x; is not too close to the 
decision boundary. More specifically, the distance from x, to the decision boundary 
is at least y = 1/||w*|| and the bound becomes (R/y)’. 

When the separability assumption does not hold, the bound involves the term 
[1 — y;(w*,x;)], which measures how much the separability with margin require- 
ment is violated. 

As a last remark we note that there can be cases in which there exists some w* 
that makes zero errors on the sequence but the Perceptron will make many errors. 
Indeed, this is a direct consequence of the fact that Ldim(#) = co. The way we 
sidestep this impossibility result is by assuming more on the sequence of examples — 
the bound in Theorem 21.16 will be meaningful only if the cumulative surrogate 
loss, >>, f:(w*) is not excessively large. 


21.5 SUMMARY 


In this chapter we have studied the online learning model. Many of the results we 
derived for the PAC learning model have an analog in the online model. First, we 
have shown that a combinatorial dimension, the Littlestone dimension, character- 
izes online learnability. To show this, we introduced the SOA algorithm (for the 
realizable case) and the Weighted-Majority algorithm (for the unrealizable case). 
We have also studied online convex optimization and have shown that online gradi- 
ent descent is a successful online learner whenever the loss function is convex and 
Lipschitz. Finally, we presented the online Perceptron algorithm as a combination 
of online gradient descent and the concept of surrogate convex loss functions. 


21.6 BIBLIOGRAPHIC REMARKS 


The Standard Optimal Algorithm was derived by the seminal work of Littlestone 
(1988). A generalization to the nonrealizable case, as well as other variants like 
margin-based Littlestone’s dimension, were derived in (Ben-David et al. 2009). 
Characterizations of online learnability beyond classification have been obtained 
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in (Abernethy, Bartlett, Rakhlin & Tewari 2008, Rakhlin, Sridharan & Tewari 
2010, Daniely et al. 2011). The Weighted-Majority algorithm is due to (Littlestone 
& Warmuth 1994) and (Vovk 1990). 

The term “online convex programming” was introduced by Zinkevich (2003) 
but this setting was introduced some years earlier by Gordon (1999). The Percep- 
tron dates back to Rosenblatt (Rosenblatt 1958). An analysis for the realizable 
case (with margin assumptions) appears in (Agmon 1954, Minsky & Papert 1969). 
Freund and Schapire (Freund & Schapire 1999) presented an analysis for the 
unrealizable case with a squared-hinge-loss based on a reduction to the realizable 
case. A direct analysis for the unrealizable case with the hinge-loss was given by 
Gentile (Gentile 2003). 

For additional information we refer the reader to Cesa-Bianchi and Lugosi 
(2006) and Shalev-Shwartz (2011). 


21.7 EXERCISES 


21.1 Find a hypothesis class H and a sequence of examples on which Consistent makes 
|H| — 1 mistakes. 

21.2 Find a hypothesis class H and a sequence of examples on which the mistake bound 
of the Halving algorithm is tight. 

21.3 Letd>2,¥ ={1,...,d} and let H = {h; : j € [d]}, where hj(x) = 1j,— ;). Calculate 
Mutaiving(H) (i.e., derive lower and upper bounds on Myaiving(H), and prove that 
they are equal). 

21.4 The Doubling Trick: 

In Theorem 21.15, the parameter 7 depends on the time horizon T. In this exercise 
we show how to get rid of this dependence by a simple trick. 

Consider an algorithm that enjoys a regret bound of the form aJ/T, but its 
parameters require the knowledge of T. The doubling trick, described in the follow- 
ing, enables us to convert such an algorithm into an algorithm that does not need 
to know the time horizon. The idea is to divide the time into periods of increasing 
size and run the original algorithm on each period. 


The Doubling Trick 


input: algorithm A whose parameters depend on the time horizon 


for m=0,1,2,... 
run A on the 2” rounds t =2”,...,2™t!-1 


Show that if the regret of A on each period of 2” rounds is at most a/2”, then the 
total regret is at most 


a 
v2-1 
21.5 Online-to-batch Conversions: In this exercise we demonstrate how a successful 
online learning algorithm can be used to derive a successful PAC learner as well. 
Consider a PAC learning problem for binary classification parameterized by an 
instance domain, 7, and a hypothesis class, H. Suppose that there exists an online 
learning algorithm, A, which enjoys a mistake bound M,4(H) < oo. Consider run- 
ning this algorithm on a sequence of T examples which are sampled 1.i.d. from a 
distribution D over the instance space 1’, and are labeled by some h* € H. Suppose 
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that for every round ft, the prediction of the algorithm is based on a hypothesis 
h,: & — {0,1}. Show that 


E[Lp(h,)] < 


where the expectation is over the random choice of the instances as well as a ran- 
dom choice of r according to the uniform distribution over [T]. 
Hint: Use similar arguments to the ones appearing in the proof of Theorem 14.8. 


Ma(H) 
eo 
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Clustering is one of the most widely used techniques for exploratory data analysis. 
Across all disciplines, from social sciences to biology to computer science, people 
try to get a first intuition about their data by identifying meaningful groups among 
the data points. For example, computational biologists cluster genes on the basis of 
similarities in their expression in different experiments; retailers cluster customers, 
on the basis of their customer profiles, for the purpose of targeted marketing; and 
astronomers cluster stars on the basis of their spacial proximity. 

The first point that one should clarify is, naturally, what is clustering? Intuitively, 
clustering is the task of grouping a set of objects such that similar objects end up in 
the same group and dissimilar objects are separated into different groups. Clearly, 
this description is quite imprecise and possibly ambiguous. Quite surprisingly, it is 
not at all clear how to come up with a more rigorous definition. 

There are several sources for this difficulty. One basic problem is that the two 
objectives mentioned in the earlier statement may in many cases contradict each 
other. Mathematically speaking, similarity (or proximity) is not a transitive relation, 
while cluster sharing is an equivalence relation and, in particular, it is a transitive 
relation. More concretely, it may be the case that there is a long sequence of objects, 
X1,...,Xm Such that each x; is very similar to its two neighbors, x;_1 and x;+1, but x, 
and x,, are very dissimilar. If we wish to make sure that whenever two elements 
are similar they share the same cluster, then we must put all of the elements of 
the sequence in the same cluster. However, in that case, we end up with dissimilar 
elements (x; and x,,) sharing a cluster, thus violating the second requirement. 

To illustrate this point further, suppose that we would like to cluster the points 
in the following picture into two clusters. 


A clustering algorithm that emphasizes not separating close-by points (e.g., the 
Single Linkage algorithm that will be described in Section 22.1) will cluster this input 
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by separating it horizontally according to the two lines: 


In contrast, a clustering method that emphasizes not having far-away points share 
the same cluster (e.g., the 2-means algorithm that will be described in Section 22.1) 
will cluster the same input by dividing it vertically into the right-hand half and the 
left-hand half: 


Another basic problem is the lack of “ground truth” for clustering, which is a 
common problem in unsupervised learning. So far in the book, we have mainly dealt 
with supervised learning (e.g., the problem of learning a classifier from labeled train- 
ing data). The goal of supervised learning is clear — we wish to learn a classifier 
which will predict the labels of future examples as accurately as possible. Further- 
more, a supervised learner can estimate the success, or the risk, of its hypotheses 
using the labeled training data by computing the empirical loss. In contrast, clus- 
tering is an unsupervised learning problem; namely, there are no labels that we 
try to predict. Instead, we wish to organize the data in some meaningful way. 
As a result, there is no clear success evaluation procedure for clustering. In fact, 
even on the basis of full knowledge of the underlying data distribution, it is not 
clear what is the “correct” clustering for that data or how to evaluate a proposed 
clustering. 

Consider, for example, the following set of points in R?: 
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and suppose we are required to cluster them into two clusters. We have two highly 
justifiable solutions: 


This phenomenon is not just artificial but occurs in real applications. A given set 
of objects can be clustered in various different meaningful ways. This may be due 
to having different implicit notions of distance (or similarity) between objects, for 
example, clustering recordings of speech by the accent of the speaker versus clus- 
tering them by content, clustering movie reviews by movie topic versus clustering 
them by the review sentiment, clustering paintings by topic versus clustering them 
by style, and so on. 

To summarize, there may be several very different conceivable clustering solu- 
tions for a given data set. As a result, there is a wide variety of clustering algorithms 
that, on some input data, will output very different clusterings. 


A Clustering Model: 

Clustering tasks can vary in terms of both the type of input they have and the type 
of outcome they are expected to compute. For concreteness, we shall focus on the 
following common setup: 


Input — a set of elements, 1, and a distance function over it. That is, a function 
d:X x X — Rx that is symmetric, satisfies d(x, x) = 0 for all x € ¥, and often 
also satisfies the triangle inequality. Alternatively, the function could be a sim- 
ilarity function s: ¥ x ¥ — [0,1] that is symmetric and satisfies s(x, x) = 1 
for all x € XY. Additionally, some clustering algorithms also require an input 
parameter k (determining the number of required clusters). 

Output — a partition of the domain set ¥ into subsets. That is, C = (C,...Cx) 
where (Ji_;C; = ¥ and for all i 4 j, C;: AC; =Q. In some situations the 
clustering is “soft,” namely, the partition of ¥ into the different clusters is 
probabilistic where the output is a function assigning to each domain point, 
x € X, a vector (pi(x),..., px(x)), where p;(x) = P[x € C;] is the probability 
that x belongs to cluster C;. Another possible output is a clustering dendro- 
gram (from Greek dendron = tree, gramma = drawing), which is a hierarchical 
tree of domain subsets, having the singleton sets in its leaves, and the full 
domain as its root. We shall discuss this formulation in more detail in the 
following. 


In the following we survey some of the most popular clustering methods. In the 
last section of this chapter we return to the high level discussion of what is clustering. 


22.1 LINKAGE-BASED CLUSTERING ALGORITHMS 


Linkage-based clustering is probably the simplest and most straightforward 
paradigm of clustering. These algorithms proceed in a sequence of rounds. They 
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start from the trivial clustering that has each data point as a single-point cluster. 
Then, repeatedly, these algorithms merge the “closest” clusters of the previous clus- 
tering. Consequently, the number of clusters decreases with each such round. If kept 
going, such algorithms would eventually result in the trivial clustering in which all of 
the domain points share one large cluster. Two parameters, then, need to be deter- 
mined to define such an algorithm clearly. First, we have to decide how to measure 
(or define) the distance between clusters, and, second, we have to determine when 
to stop merging. Recall that the input to a clustering algorithm is a between-points 
distance function, d. There are many ways of extending d to a measure of distance 
between domain subsets (or clusters). The most common ways are 


1. Single Linkage clustering, in which the between-clusters distance is defined 
by the minimum distance between members of the two clusters, namely, 


D(A, B) © min{d(x,y):x € A, y € B} 
2. Average Linkage clustering, in which the distance between two clusters is 
defined to be the average distance between a point in one of the clusters and 
a point in the other, namely, 


p(4,B) § —_ > dt,y) 


|AIIB] 4 
J ,yeB 


3. Max Linkage clustering, in which the distance between two clusters is defined 
as the maximum distance between their elements, namely, 


D(A, B) © max{d(x,y):x € A, y € B}. 


The linkage-based clustering algorithms are agglomerative in the sense that they 
start from data that is completely fragmented and keep building larger and larger 
clusters as they proceed. Without employing a stopping rule, the outcome of such 
an algorithm can be described by a clustering dendrogram: that is, a tree of domain 
subsets, having the singleton sets in its leaves, and the full domain as its root. For 
example, if the input is the elements V = {a,b,c,d,e} C R? with the Euclidean dis- 
tance as depicted on the left, then the resulting dendrogram is the one depicted on 
the right: 


{a, b,c, d, e} 
\ 
ea {b, c, d, e} 
ee a 
ed {b,c} {d, e} 
- /\ 7X 
eb {a} {b} te} atte 


The single linkage algorithm is closely related to Kruskal’s algorithm for finding 
a minimal spanning tree on a weighted graph. Indeed, consider the full graph whose 
vertices are elements of 1 and the weight of an edge (x, y) is the distance d(x, y). 
Each merge of two clusters performed by the single linkage algorithm corresponds 
to a choice of an edge in the aforementioned graph. It is also possible to show that 
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the set of edges the single linkage algorithm chooses along its run forms a minimal 
spanning tree. 

If one wishes to turn a dendrogram into a partition of the space (a clustering), 
one needs to employ a stopping criterion. Common stopping criteria include 


™ Fixed number of clusters — fix some parameter, k, and stop merging clusters as 
soon as the number of clusters is k. 

M™ Distance upper bound - fix some r € R;. Stop merging as soon as all 
the between-clusters distances are larger than r. We can also set r to be 
amax{d(x,y):x,y € X} for some a < 1. In that case the stopping criterion is 
called “scaled distance upper bound.” 


22.2 k-MEANS AND OTHER COST MINIMIZATION CLUSTERINGS 


Another popular approach to clustering starts by defining a cost function over a 
parameterized set of possible clusterings and the goal of the clustering algorithm is 
to find a partitioning (clustering) of minimal cost. Under this paradigm, the cluster- 
ing task is turned into an optimization problem. The objective function is a function 
from pairs of an input, (4’,d), and a proposed clustering solution C = (C1,..., Cx), 
to positive real numbers. Given such an objective function, which we denote by G, 
the goal of a clustering algorithm is defined as finding, for a given input (¥,d), a 
clustering C so that G((4’,d), C) is minimized. In order to reach that goal, one has 
to apply some appropriate search algorithm. 

As it turns out, most of the resulting optimization problems are NP-hard, and 
some are even NP-hard to approximate. Consequently, when people talk about, 
say, k-means clustering, they often refer to some particular common approximation 
algorithm rather than the cost function or the corresponding exact solution of the 
minimization problem. 

Many common objective functions require the number of clusters, k, as a param- 
eter. In practice, it is often up to the user of the clustering algorithm to choose the 
parameter k that is most suitable for the given clustering problem. 

In the following we describe some of the most common objective functions. 


The k-means objective function is one of the most popular clustering objectives. 
In k-means the data is partitioned into disjoint sets C,,...,C; where each C; is 
represented by a centroid ju;. It is assumed that the input set V is embedded in 
some larger metric space (4’,d) (so that XY C 4’) and centroids are members 
of XV’. The k-means objective function measures the squared distance between 
each point in ¥ to the centroid of its cluster. The centroid of C; is defined to be 


Hi (C;) = argmin S- d(x, m). 
Mex! xECj 
Then, the k-means objective is 


k 


Gi_means((4’,d), (C1,--.Ce)) = >> >> dx, ui(C))). 


i=l xEC; 
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This can also be rewritten as 


Gi-means((¥,4), (C1, .--+Ck)) = oe SS Sd bi)’. (22.1) 


i=l xEC; 


The k-means objective function is relevant, for example, in digital communi- 
cation tasks, where the members of 1 may be viewed as a collection of signals 
that have to be transmitted. While 7 may be a very large set of real valued vec- 
tors, digital transmission allows transmitting of only a finite number of bits for 
each signal. One way to achieve good transmission under such constraints is to 
represent each member of 1 by a “close” member of some finite set t1,... Wx, 
and replace the transmission of any x € ¥ by transmitting the index of the 
closest w;. The k-means objective can be viewed as a measure of the distortion 
created by such a transmission representation scheme. 

The k-medoids objective function is similar to the k-means objective, except that 
it requires the cluster centroids to be members of the input set. The objective 
function is defined by 


Gix—medoia((#sd), (C1..+-5Cx)) = min yy Hi). 


i=l xEC; 


The k-median objective function is quite similar to the k-medoids objective, 
except that the “distortion” between a data point and the centroid of its cluster 
is measured by distance, rather than by the square of the distance: 


GxK—median((V,d), (C1,.-. Ce) = es Se fi). 


i=l xEC; 


An example where such an objective makes sense is the facility location prob- 
lem. Consider the task of locating k fire stations in a city. One can model 
houses as data points and aim to place the stations so as to minimize the 
average distance between a house and its closest fire station. 


The previous examples can all be viewed as center-based objectives. The solu- 
tion to such a clustering problem is determined by a set of cluster centers, and the 
clustering assigns each instance to the center closest to it. More generally, center- 
based objective is determined by choosing some monotonic function f : Ry > R+ 
and then defining 


Gp(¥.d), (C1. Ce)) = min SY rate, ti)), 


i=1 xEC; 


where %’ is either Y or some superset of ¥. 
Some objective functions are not center based. For example, the sum of in-cluster 
distances (SOD) 


k 
Gsop((¥,4),(C1,-..Ce)) => 5 a(x, y) 


i=1 x,yECj 
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and the MinCut objective that we shall discuss in Section 22.3 are not center-based 
objectives. 


2.1 The k-Means Algorithm 


The k-means objective function is quite popular in practical applications of clus- 
tering. However, it turns out that finding the optimal k-means solution is often 
computationally infeasible (the problem is NP-hard, and even NP-hard to approx- 
imate to within some constant). As an alternative, the following simple iterative 
algorithm is often used, so often that, in many cases, the term k-means Clustering 
refers to the outcome of this algorithm rather than to the clustering that minimizes 
the k-means objective cost. We describe the algorithm with respect to the Euclidean 
distance function d(x, y) = ||x— y|l. 


k-Means 


input: Y CR” ; Number of clusters k 
initialize: Randomly choose initial centroids py,..., Lx 


repeat until convergence 
Vi € [k] set C; = {x € ¥ i = argmin; ||x — ;|\} 
(break ties in some arbitrary manner) 
Vi € [k] update pw; = Teal exec; X 


Lemma 22.1. Each iteration of the k-means algorithm does not increase the k-means 
objective function (as given in Equation (22.1)). 


Proof. To simplify the notation, let us use the shorthand G(C1,...,C;,) for the 
k-means objective, namely, 


(Crys Ck) = => IIx— mI". (22.2) 
as i=1 xeC; 
It is convenient to define p(C;) = Gl Dexec,X and note that (Ci) = 
argmin, pn yee \|x — ||’. Therefore, we can rewrite the k-means objective as 
k 
G(C1,.... CK) => So Ix w(Gi)I?. (22.3) 
i=1 xeEC; 


Consider the update at iteration ¢ of the k-means algorithm. Let C; (- ”, Pe ee 


be the previous partition, let ul Ms wc ), and let c,...c® be the new 
partition assigned at iteration f. Udine the definition of the objective as given in 
Equation (22.2) we clearly have that 


a(c®,...,c®) < aS S> x— a P /P? (22.4) 
a I yee 
In addition, the definition of the new partition (C, oF a ci) implies that it 


t—1 
minimizes the expression Sa |x — us dy over all possible partitions 
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(Ci,...,Cx). Hence, 


k k 
Se xe PPP <> SD x ef IP. (22.5) 


i=l xec i=l xec) 


Using Equation (22.3) we have that the right-hand side of Equation (22.5) equals 
reo say coy, Combining this with Equation (22.4) and Equation (22.5), we 
obtain that a(c®, shes c) < et”, dene ch), which concludes our proof. O 


While the preceding lemma tells us that the k-means objective is monotonically 
nonincreasing, there is no guarantee on the number of iterations the k-means algo- 
rithm needs in order to reach convergence. Furthermore, there is no nontrivial lower 
bound on the gap between the value of the k-means objective of the algorithm’s 
output and the minimum possible value of that objective function. In fact, k-means 
might converge to a point which is not even a local minimum (see Exercise 22.2). 
To improve the results of k-means it is often recommended to repeat the procedure 
several times with different randomly chosen initial centroids (e.g., we can choose 
the initial centroids to be random points from the data). 


-3 SPECTRAL CLUSTERING 


Often, a convenient way to represent the relationships between points in a data set 
X = {x1,...,Xm} is by a similarity graph; each vertex represents a data point x;, and 
every two vertices are connected by an edge whose weight is their similarity, W;,; = 
s(xi,x;), where W ¢ R””. For example, we can set W;,; = exp( — d(xi,x;)*/o7), 
where d(-,-) is a distance function and o is a parameter. The clustering problem can 
now be formulated as follows: We want to find a partition of the graph such that the 
edges between different groups have low weights and the edges within a group have 
high weights. 

In the clustering objectives described previously, the focus was on one side of 
our intuitive definition of clustering - making sure that points in the same cluster 
are similar. We now present objectives that focus on the other requirement — points 
separated into different clusters should be nonsimilar. 


.L Graph Cut 


Given a graph represented by a similarity matrix W, the simplest and most direct 
way to construct a partition of the graph is to solve the mincut problem, which 


chooses a partition C,,...,C, that minimizes the objective 
k 
emi Cus), Wi: 
i=1 reC;,s¢C; 


For k = 2, the mincut problem can be solved efficiently. However, in practice it 
often does not lead to satisfactory partitions. The problem is that in many cases, the 
solution of mincut simply separates one individual vertex from the rest of the graph. 
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Of course, this is not what we want to achieve in clustering, as clusters should be 
reasonably large groups of points. 

Several solutions to this problem have been suggested. The simplest solution is 
to normalize the cut and define the normalized mincut objective as follows: 


k 
1 
RatioCut(Cj,...,Cx) = Sic, ae W,s- 
i=1 | : reC; ,s¢C; 


The preceding objective assumes smaller values if the clusters are not too small. 
Unfortunately, introducing this balancing makes the problem computationally hard 
to solve. Spectral clustering is a way to relax the problem of minimizing RatioCut. 


22.3.2 Graph Laplacian and Relaxed Graph Cuts 


The main mathematical object for spectral clustering is the graph Laplacian matrix. 
There are several different definitions of graph Laplacian in the literature, and in 
the following we describe one particular definition. 


Definition 22.2 (Unnormalized Graph Laplacian). The unnormalized graph Lapla- 
cian is the m x m matrix L = D— W where D is a diagonal matrix with 
Dii= ee W;,;. The matrix D is called the degree matrix. 


The following lemma underscores the relation between RatioCut and the 
Laplacian matrix. 


Lemma 22.3. Let C,..., Cx be a clustering and let H € R”™-* be the matrix such that 
2d 
Hi, = Vici Liec;]: 
Then, the columns of H are orthonormal to each other and 
RatioCut(C,...,C,) = trace(H' L A). 


Proof. Leth,,...,h, be the columns of H. The fact that these vectors are orthonor- 
mal is immediate from the definition. Next, by standard algebraic manipulations, it 
can be shown that trace(H ' LH) = S~\_,h} Lh; and that for any vector v we have 


vi Lv= ; (= Dei - oe Ur Us Wros + d. b,x) = oe W,.5 (vy — Us). 


Applying this with v= h; and noting that (hj, —hj,s)* is nonzero only if r € C;,s ¢ C; 
or the other way around, we obtain that 


1 
ls = S 
h, Lh; = IG Ws. 
, reC; s¢C; 
O 


Therefore, to minimize RatioCut we can search for a matrix H whose columns 
are orthonormal and such that each Hj, ; is either 0 or 1/,/|C;|. Unfortunately, this 
is an integer programming problem which we cannot solve efficiently. Instead, we 
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relax the latter requirement and simply search an orthonormal matrix H € R”* 
that minimizes trace(H' LH). As we will see in the next chapter about PCA (par- 
ticularly, the proof of Theorem 23.2), the solution to this problem is to set U to 
be the matrix whose columns are the eigenvectors corresponding to the k min- 
imal eigenvalues of L. The resulting algorithm is called Unnormalized Spectral 
Clustering. 


3 Unnormalized Spectral Clustering 
Unnormalized Spectral Clustering 


Input: W ¢ R”’”” ; Number of clusters k 
Initialize: Compute the unnormalized graph Laplacian L 
Let U € R'”* be the matrix whose columns are the eigenvectors of L 


corresponding to the k smallest eigenvalues 
Let vj,...,V be the rows of U 
Cluster the points v1,...,¥,. using k-means 
Output: Clusters C;,...,Cx of the k-means algorithm 


The spectral clustering algorithm starts with finding the matrix H of the k eigen- 
vectors corresponding to the smallest eigenvalues of the graph Laplacian matrix. It 
then represents points according to the rows of H. It is due to the properties of the 
graph Laplacians that this change of representation is useful. In many situations, 
this change of representation enables the simple k-means algorithm to detect the 
clusters seamlessly. Intuitively, if H is as defined in Lemma 22.3 then each point in 
the new representation is an indicator vector whose value is nonzero only on the 
element corresponding to the cluster it belongs to. 


22.4 INFORMATION BOTTLENECK* 


The information bottleneck method is a clustering technique introduced by Tishby, 
Pereira, and Bialek. It relies on notions from information theory. To illustrate the 
method, consider the problem of clustering text documents where each document 
is represented as a bag-of-words; namely, each document is a vector x = {0,1}”, 
where n is the size of the dictionary and x; = 1 iff the word corresponding to index 
i appears in the document. Given a set of m documents, we can interpret the bag- 
of-words representation of the m documents as a joint probability over a random 
variable x, indicating the identity of a document (thus taking values in [m]), and a 
random variable y, indicating the identity of a word in the dictionary (thus taking 
values in [n]). 

With this interpretation, the information bottleneck refers to the identity of a 
clustering as another random variable, denoted C, that takes values in [k] (where 
k will be set by the method as well). Once we have formulated x, y,C as random 
variables, we can use tools from information theory to express a clustering objective. 
In particular, the information bottleneck objective is 


min I(x;C)—BI(C;y), 
P(Clx) 
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where /(.;-) is the mutual information between two random variables,! 6 is a param- 
eter, and the minimization is over all possible probabilistic assignments of points to 
clusters. Intuitively, we would like to achieve two contradictory goals. On one hand, 
we would like the mutual information between the identity of the document and 
the identity of the cluster to be as small as possible. This reflects the fact that we 
would like a strong compression of the original data. On the other hand, we would 
like high mutual information between the clustering variable and the identity of the 
words, which reflects the goal that the “relevant” information about the document 
(as reflected by the words that appear in the document) is retained. This generalizes 
the classical notion of minimal sufficient statistics” used in parametric statistics to 
arbitrary distributions. 

Solving the optimization problem associated with the information bottleneck 
principle is hard in the general case. Some of the proposed methods are similar 
to the EM principle, which we will discuss in Chapter 24. 


22.5 A HIGH LEVEL VIEW OF CLUSTERING 


So far, we have mainly listed various useful clustering tools. However, some funda- 
mental questions remain unaddressed. First and foremost, what is clustering? What 
is it that distinguishes a clustering algorithm from any arbitrary function that takes 
an input space and outputs a partition of that space? Are there any basic properties 
of clustering that are independent of any specific algorithm or task? 

One method for addressing such questions is via an axiomatic approach. There 
have been several attempts to provide an axiomatic definition of clustering. Let us 
demonstrate this approach by presenting the attempt made by Kleinberg (2003). 

Consider a clustering function, F, that takes as input any finite domain ¥V witha 
dissimilarity function d over its pairs and returns a partition of 1. 

Consider the following three properties of such a function: 


Scale Invariance (SI) For any domain set 4’, dissimilarity function d, and any 


a > 0, the following should hold: F(4,d) = F(¥,ad) (where (ad)(x, y) oy 


ad(x,y)). 

Richness (Ri) For any finite 7 and every partition C = (Cj,...C,) of X (into 
nonempty subsets) there exists some dissimilarity function d over ¥ such that 
F(X,d)=C. 

Consistency (Co) If d and d’ are dissimilarity functions over 1’, such that for every 
x,y € X, if x,y belong to the same cluster in F(¥V,d) then d’(x, y) < d(x, y) 
and if x, y belong to different clusters in F(4,d) then d(x, y) > d(x, y), then 
F(X,d) = F(X, a’). 


! That is, given a probability function, p over the pairs (x,C), I(x:C) = >, ©, p(a, b) log (fe4.). 
where the sum is over all values x can take and all values C can take. 

2 A sufficient statistic is a function of the data which has the property of sufficiency with respect to a 
statistical model and its associated unknown parameter, meaning that “no other statistic which can be 
calculated from the same sample provides any additional information as to the value of the parameter.” 
For example, if we assume that a variable is distributed normally with a unit variance and an unknown 


expectation, then the average function is a sufficient statistic. 
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A moment of reflection reveals that the Scale Invariance is a very natural 
requirement — it would be odd to have the result of a clustering function depend 
on the units used to measure between-point distances. The Richness requirement 
basically states that the outcome of the clustering function is fully controlled by 
the function d, which is also a very intuitive feature. The third requirement, Con- 
sistency, is the only requirement that refers to the basic (informal) definition of 
clustering — we wish that similar points will be clustered together and that dissimilar 
points will be separated to different clusters, and therefore, if points that already 
share a cluster become more similar, and points that are already separated become 
even less similar to each other, the clustering function should have even stronger 
“support” of its previous clustering decisions. 

However, Kleinberg (2003) has shown the following “impossibility” result: 


Theorem 22.4. There exists no function, F, that satisfies all the three properties: Scale 
Invariance, Richness, and Consistency. 


Proof. Assume, by way of contradiction, that some F does satisfy all three proper- 
ties. Pick some domain set ¥ with at least three points. By Richness, there must be 
some d; such that F(¥V,d1) = {{x}: x € ¥} and there also exists some dz such that 
F(X,do) # F(X, d1). 

Let aw € Ry be such that for every x,y € ¥, ado(x,y) = di(x,y). Let d3 = 
ad2. Consider F(4,d3). By the Scale Invariance property of F, we should have 
F(X, d3) = F(X, dz). On the other hand, since all distinct x, y € ¥ reside in differ- 
ent clusters w.r.t. F(4,d1), and d3(x,y) > d\(x,y), the Consistency of F implies 
that F(¥,d3) = F(X,d,). This is a contradiction, since we chose d;,d2 so that 
F(X, do) # F(X, d}). O 


It is important to note that there is no single “bad axiom” there is no single 
“bad property” among the three properties. For every pair of the three axioms, 
there exist natural clustering functions that satisfy the two properties in that pair 
(one can even construct such examples just by varying the stopping criteria for the 
Single Linkage clustering function). On the other hand, Kleinberg shows that any 
clustering algorithm that minimizes any center-based objective function inevitably 
fails the consistency property (yet, the k-sum-of-in-cluster-distances minimization 
clustering does satisfy Consistency). 

The Kleinberg impossibility result can be easily circumvented by varying the 
properties. For example, if one wishes to discuss clustering functions that have 
a fixed number-of-clusters parameter, then it is natural to replace Richness by k- 
Richness (namely, the requirement that every partition of the domain into k subsets 
is attainable by the clustering function). k-Richness, Scale Invariance and Consis- 
tency all hold for the k-means clustering and are therefore consistent. Alternatively, 
one can relax the Consistency property. For example, say that two clusterings 
C =(Ci,...Cx) and C’ = (Cj,...C;) are compatible if for every clusters C; € C and 
or eC’, either C; C oF or C; CC; or G3 N Ci = @ (it is worthwhile noting that for 
every dendrogram, every two clusterings that are obtained by trimming that den- 
drogram are compatible). “Refinement Consistency” is the requirement that, under 
the assumptions of the Consistency property, the new clustering F(4’,d’) is compat- 
ible with the old clustering F(4,d). Many common clustering functions satisfy this 
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requirement as well as Scale Invariance and Richness. Furthermore, one can come 
up with many other, different, properties of clustering functions that sound intuitive 
and desirable and are satisfied by some common clustering functions. 

Furthermore, one can come up with many other, different, properties of cluster- 
ing functions that sound intuitive and desirable and are satisfied by some common 
clustering functions. 

There are many ways to interpret these results. We suggest to view it as indi- 
cating that there is no “ideal” clustering function. Every clustering function will 
inevitably have some “undesirable” properties. The choice of a clustering function 
for any given task must therefore take into account the specific properties of that 
task. There is no generic clustering solution, just as there is no classification algo- 
rithm that will learn every learnable task (as the No-Free-Lunch theorem shows). 
Clustering, just like classification prediction, must take into account some prior 
knowledge about the specific task at hand. 


22.6 SUMMARY 


Clustering is an unsupervised learning problem, in which we wish to partition a set 
of points into “meaningful” subsets. We presented several clustering approaches 
including linkage-based algorithms, the k-means family, spectral clustering, and 
the information bottleneck. We discussed the difficulty of formalizing the intuitive 
meaning of clustering. 


22.7 BIBLIOGRAPHIC REMARKS 


The k-means algorithm is sometimes named Lloyd’s algorithm, after Stuart Lloyd, 
who proposed the method in 1957. For a more complete overview of spectral clus- 
tering we refer the reader to the excellent tutorial by Von Luxburg (2007). The 
information bottleneck method was introduced by Tishby, Pereira, and Bialek 
(1999). For an additional discussion on the axiomatic approach see Ackerman and 
Ben-David (2008). 


22.8 EXERCISES 


22.1 Suboptimality of k-Means: For every parameter t > 1, show that there exists an 
instance of the k-means problem for which the k-means algorithm (might) find a 
solution whose k-means objective is at least t- OPT, where OPT is the minimum 
k-means objective. 

22.2 k-Means Might Not Necessarily Converge to a Local Minimum: Show that the k- 
means algorithm might converge to a point which is not a local minimum. Hint: 
Suppose that k = 2 and the sample points are {1,2,3,4} C R suppose we initialize 
the k-means with the centers {2,4}; and suppose we break ties in the definition of 
C; by assigning i to be the smallest value in argmin ; Ix— fll. 

22.3 Given a metric space (¥V,d), where || < oo, and k € N, we would like to find a 
partition of V into C;,...,C, which minimizes the expression 


Gx-diam((¥, d), (C1, ..-, Ck)) = ne diam(C;), 
j 
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where diam(C;) = max,,xec; d(x,x’) (we use the convention diam(C;) = 0 if 
ICj| <2). 

Similarly to the k-means objective, it is NP-hard to minimize the k-diam objec- 
tive. Fortunately, we have a very simple approximation algorithm: Initially, we pick 
some x € ¥ and set jx, = x. Then, the algorithm iteratively sets 


Vj €{2,...,k}, uj =argmax min d(x, 14;). 
xex ie[j—1] 
Finally, we set 
Vi €[k], Ci ={x € X:i =argmind(x, p;)}. 
Jelk] 
Prove that the algorithm described is a 2-approximation algorithm. That is, if we 
denote its output by C;,..., C,, and denote the optimal solution by Cy, ..., Cf, then, 


Geen (PAA Cicer, Cx) VS 2 Ci aie (Oe ACE CP), 


Hint: Consider the point 1,41 (in other words, the next center we would have cho- 
sen, if we wanted k + 1 clusters). Let r = min etx d(uj, Me+1). Prove the following 
inequalities 


Gi_diam (4d), (Cini +.,Ce)) S20 
Gx-diam((X, d), (Cf, sana Cr)) >r. 


Recall that a clustering function, F, is called Center-Based Clustering if, for 
some monotonic function f : Ry — R,, on every given input (7,d), F(¥,d) is 
a clustering that minimizes the objective 


k 
Gp(H.d). (C1... Ci) = min DD | f(a mi), 


i=1 x€C; 


where 1’ is either Y or some superset of ¥. 
Prove that for every k > 1 the k-diam clustering function defined in the previous 
exercise is not a center-based clustering function. 
Hint: Given a clustering input (4,d), with |¥| > 2, consider the effect of adding 
many close-by points to some (but not all) of the members of 1, on either the 
k-diam clustering or any given center-based clustering. 
Recall that we discussed three clustering “properties”: Scale Invariance, Richness, 
and Consistency. Consider the Single Linkage clustering algorithm. 
1. Find which of the three properties is satisfied by Single Linkage with the Fixed 
Number of Clusters (any fixed nonzero number) stopping rule. 
2. Find which of the three properties is satisfied by Single Linkage with the 
Distance Upper Bound (any fixed nonzero upper bound) stopping rule. 
3. Show that for any pair of these properties there exists a stopping criterion for 
Single Linkage clustering, under which these two axioms are satisfied. 
Given some number k, let k-Richness be the following requirement: 
For any finite X and every partition C = (C1,...Cx) of & (into nonempty subsets) 
there exists some dissimilarity function d over X such that F(X ,d) =C. 
Prove that, for every number k, there exists a clustering function that satisfies the 
three properties: Scale Invariance, k-Richness, and Consistency. 
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Dimensionality reduction is the process of taking data in a high dimensional space 
and mapping it into a new space whose dimensionality is much smaller. This process 
is closely related to the concept of (lossy) compression in information theory. There 
are several reasons to reduce the dimensionality of the data. First, high dimensional 
data impose computational challenges. Moreover, in some situations high dimen- 
sionality might lead to poor generalization abilities of the learning algorithm (for 
example, in Nearest Neighbor classifiers the sample complexity increases exponen- 
tially with the dimension—see Chapter 19). Finally, dimensionality reduction can 
be used for interpretability of the data, for finding meaningful structure of the data, 
and for illustration purposes. 

In this chapter we describe popular methods for dimensionality reduction. In 
those methods, the reduction is performed by applying a linear transformation to 
the original data. That is, if the original data is in R¢ and we want to embed it into 
IR” (n <d) then we would like to find a matrix W ¢ R’“ that induces the mapping 
xt» Wx. A natural criterion for choosing W is in a way that will enable a reasonable 
recovery of the original x. It is not hard to show that in general, exact recovery of x 
from Wx is impossible (see Exercise 23.1). 

The first method we describe is called Principal Component Analysis (PCA). 
In PCA, both the compression and the recovery are performed by linear transfor- 
mations and the method finds the linear transformations for which the differences 
between the recovered vectors and the original vectors are minimal in the least 
squared sense. 

Next, we describe dimensionality reduction using random matrices W. We 
derive an important lemma, often called the “Johnson-Lindenstrauss lemma,” 
which analyzes the distortion caused by such a random dimensionality reduction 
technique. 

Last, we show how one can reduce the dimension of all sparse vectors using 
again a random matrix. This process is known as Compressed Sensing. In this case, 
the recovery process is nonlinear but can still be implemented efficiently using linear 
programming. 
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We conclude by underscoring the underlying “prior assumptions” behind PCA 
and compressed sensing, which can help us understand the merits and pitfalls of the 
two methods. 


. PRINCIPAL COMPONENT ANALYSIS (PCA) 


Let x1,...,X,, be m vectors in R“. We would like to reduce the dimensionality of 
these vectors using a linear transformation. A matrix W € R"¢, where n <d, induces 
a mapping xb» Wx, where Wx € R” is the lower dimensionality representation of x. 
Then, a second matrix U € R“” can be used to (approximately) recover each original 
vector x from its compressed version. That is, for a compressed vector y = Wx, 
where y is in the low dimensional space R”, we can construct x = Uy, so that x is the 
recovered version of x and resides in the original high dimensional space R¢. 

In PCA, we find the compression matrix W and the recovering matrix U so that 
the total squared distance between the original and recovered vectors is minimal; 
namely, we aim at solving the problem 


m 
argmin S |x; — U Wx; lee (23.1) 
WeR"4,UeR@" 4 


To solve this problem we first show that the optimal solution takes a specific 
form. 


Lemma 23.1. Let (U,W) be a solution to Equation (23.1). Then the columns of U 
are orthonormal (namely, U' U is the identity matrix of R") and W =U". 


Proof. Fix any U, W and consider the mapping x > U Wx. The range of this map- 
ping, R = {UWx:x € R%}, is ann dimensional linear subspace of R?. Let V € R@” 
be a matrix whose columns form an orthonormal basis of this subspace, namely, the 
range of V is R and V'V =/. Therefore, each vector in R can be written as Vy 
where y € R”. For every x € R¢ and y € R” we have 


lIx—VyllZ = Ix? ty'V' Vy—2y'V'x = [Ixll?+ lly? -2y'(V"x), 


where we used the fact that V' V is the identity matrix of R”. Minimizing the pre- 
ceding expression with respect to y by comparing the gradient with respect to y to 
zero gives that y = V 'x. Therefore, for each x we have that 


VV 'x=argmin ||x — x|[5. 
xeR 


In particular this holds for x;,...,x,, and therefore we can replace U, W by V, vi 
and by that do not increase the objective 


m m 

2 Tei 
5 |x; - UWx;||5 = s |x; —VV x; ||. 
i=1 i=1 


Since this holds for every U, W the proof of the lemma follows. O 
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On the basis of the preceding lemma, we can rewrite the optimization problem 
given in Equation (23.1) as follows: 


m 
argmin S- IIx; — UU' x; II5. (23.2) 
UeR@":UTU=I 5-4 


We further simplify the optimization problem by using the following elementary 
algebraic manipulations. For every x € R4@ and a matrix U ¢ R@” such that U'U =1 
we have 


Ix — UU'x||* = ||x||? —2x' UU'x+x'uU'UU'x 
= |x|? —x'! UU'x 
= ||x||? — trace(U 'xx'U), (23.3) 


where the trace of a matrix is the sum of its diagonal entries. Since the trace is a 
linear operator, this allows us to rewrite Equation (23.2) as follows: 


m 
argmax trace| U' S- xx} U |. (23.4) 
UeR@":UT USI i= 


Let A = )77",x;x}. The matrix A is symmetric and therefore it can be written 
using its spectral decomposition as A = VDV', where D is diagonal and V'V = 
VV' =I. Here, the elements on the diagonal of D are the eigenvalues of A and 
the columns of V are the corresponding eigenvectors. We assume without loss of 
generality that D1 1 > D22 >--- > Dg. Since A is positive semidefinite it also holds 
that Dg.q > 0. We claim that the solution to Equation (23.4) is the matrix U whose 
columns are the n eigenvectors of A corresponding to the largest n eigenvalues. 


Theorem 23.2. Let x;,...,Xm be arbitrary vectors in R4, let A = eS xix}, and let 
W,...,U, ben eigenvectors of the matrix A corresponding to the largest n eigenvalues 
of A. Then, the solution to the PCA optimization problem given in Equation (23.1) is 
to set U to be the matrix whose columns are W,...,U, and to setW =U". 


Proof. Let VDV' be the spectral decomposition of A. Fix some matrix U € R4” 
with orthonormal columns and let B = V'U. Then, VB = VV'U =U. It follows 
that 


U'AU=B'V'VDV'VB=B'DB, 


and therefore 


d n 
trace(U' AU) = S- Dj,j . Bj j. 
j=l i=l 


Note that B'B =U'VV'U =U'U =I. Therefore, the columns of B are also 
orthonormal, which implies that $74_, )7/_, B?; =n. In addition, let B € Rid be 
a matrix such that its first 7 columns are the columns of B and in addition B' B = /. 
Then, for every j we have ye = 1, which implies that })j_, By, < 1. It 
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follows that 
d 


trace(U ' AU) & max D;.;B;. 
Be[0,1]4:Bllisn 2 = 


It is not hard to verify (see 23.2) that the right-hand side equals }7";_, Dj,;. We have 
therefore shown that for every matrix U ¢ R¢” with orthonormal columns it holds 
that trace(U' AU) < )=1 Dj,;. On the other hand, if we set U to be the matrix 
whose columns are the n leading eigenvectors of A we obtain that trace(U ' AU) = 
7 =1 Dj,;, and this concludes our proof. O 


Remark 23.1. The proof of Theorem 23.2 also tells us that the value of the objective 
of Equation (23.4) is }°/_, Dj,;. Combining this with Equation (23.3) and noting 
that )~”, |x; |? = trace(A) = ee D;,, we obtain that the optimal objective value 
of Equation (23.1) is 1“, , , Di.i- 


Remark 23.2. It is a common practice to “center” the examples before applying 
PCA. That is, we first calculate pw = - >-y_1 x; and then apply PCA on the vectors 
(x1 —t),..., (Xm — #). This is also related to the interpretation of PCA as variance 
maximization (see Exercise 23.4). 


23.1.1 A More Efficient Solution for the Case d >> m 


In some situations the original dimensionality of the data is much larger than the 
number of examples m. The computational complexity of calculating the PCA solu- 
tion as described previously is O(d*) (for calculating eigenvalues of A) plus O(md”) 
(for constructing the matrix A). We now show a simple trick that enables us to 
calculate the PCA solution more efficiently when d >> m. 

Recall that the matrix A is defined to be )7".,x;x;. It is convenient to rewrite 
A = X'X where X € R'”“ is a matrix whose ith row is x}. Consider the matrix 
B=XX'.Thatis, B € R”” is the matrix whose i, j element equals (x;,x;). Suppose 
that u is an eigenvector of B: That is, Bu = Au for some 4 € R. Multiplying the 
equality by X' and using the definition of B we obtain X' XX 'u=AX 'u. But, using 
the definition of A, we get that A(X 'u) = A(X! u). Thus, Ae 
A with eigenvalue of i. 

We can therefore calculate the PCA solution by calculating the eigenvalues of 
B instead of A. The complexity is O(m*) (for calculating eigenvalues of B) and m*d 
(for constructing the matrix B). 


is an eigenvector of 


Remark 23.3. The previous discussion also implies that to calculate the PCA solu- 
tion we only need to know how to calculate inner products between vectors. This 
enables us to calculate PCA implicitly even when d is very large (or even infinite) 
using kernels, which yields the kernel PCA algorithm. 


23.1.2 Implementation and Demonstration 


A pseudocode of PCA is given in the following. 
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PCA 


input 
A matrix of m examples X ¢ R”4 
number of components n 
if (m > d) 
Aa xx 
Let u,,...,u, be the eigenvectors of A with largest eigenvalues 
else 
R= xx" 
Let vj,..., Vn be the eigenvectors of B with largest eigenvalues 
fori=1,...,nsetu; = XV; 


—_i 
XT vill 


output: u),...,U, 


To illustrate how PCA works, let us generate vectors in R* that approximately 
reside ona line, namely, on a one dimensional subspace of R?. For example, suppose 
that each example is of the form (x, x + y) where x is chosen uniformly at random 
from [— 1,1] and y is sampled from a Gaussian distribution with mean 0 and stan- 
dard deviation of 0.1. Suppose we apply PCA on this data. Then, the eigenvector 
corresponding to the largest eigenvalue will be close to the vector (1/2, 1/2). 
When projecting a point (x,x + y) on this principal component we will obtain the 
scalar a, The reconstruction of the original vector will be ((x + y/2), (x + y/2)). 


In Figure 23.1 we depict the original versus reconstructed data. 


1.5 T T T T T 


-0.5+ : e * 4 


1.5 1 1 1 i 1 
-1.5 -1 —0.5 0 0.5 1 15 


Figure 23.1. A set of vectors in R? (x’s) and their reconstruction after dimensionality 
reduction to R! using PCA (circles). 


www.EngineeringBooksLibrary.com 


23.2 Random Projections 


Se Ea Eee Ee 
: 25 A eS PS eS ee =) am Pe =p 
BRR Baeae se ib |e 
, | ! 
BERR eee 
RARRARER BRERA 
= j i — 7 


r+ BS Ea 


Figure 23.2. Images of faces extracted from the Yale data set. Top-left: the original 
images in R°°*9. Top-right: the images after dimensionality reduction to R!° and recon- 
struction. Middle row: an enlarged version of one of the images before and after PCA. 
Bottom: the images after dimensionality reduction to R?. The different marks indicate 
different individuals. 


Next, we demonstrate the effectiveness of PCA on a data set of faces. We 
extracted images of faces from the Yale data set (Georghiades, Belhumeur & 
Kriegman 2001). Each image contains 50 x 50 = 2500 pixels; therefore the original 
dimensionality is very high. 

Some images of faces are depicted on the top-left side of Figure 23.2. Using PCA, 
we reduced the dimensionality to R! and reconstructed back to the original dimen- 
sion, which is 50*. The resulting reconstructed images are depicted on the top-right 
side of Figure 23.2. Finally, on the bottom of Figure 23.2 we depict a 2 dimen- 
sional representation of the images. As can be seen, even from a 2 dimensional 
representation of the images we can still roughly separate different individuals. 


23.2 RANDOM PROJECTIONS 


In this section we show that reducing the dimension by using a random linear trans- 
formation leads to a simple compression scheme with a surprisingly low distortion. 
The transformation x + Wx, when W is a random matrix, is often referred to 
as a random projection. In particular, we provide a variant of a famous lemma 
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due to Johnson and Lindenstrauss, showing that random projections do not distort 
Euclidean distances too much. 

Let x1,x2 be two vectors in R¢. A matrix W does not distort too much the 
distance between x and xz if the ratio 


|| Wx1 — Wx] 
x1 — Xo 


is close to 1. In other words, the distances between x; and x2 before and after the 
transformation are almost the same. To show that || Wx; — Wxz|| is not too far away 
from ||x; — x2|| it suffices to show that W does not distort the norm of the difference 
vector x =X; — X2. Therefore, from now on we focus on the ratio ree 

We start with analyzing the distortion caused by applying a random projection 


to a single vector. 


Lemma 23.3. Fix some x € R¢. Let W € R"“@ be a random matrix such that each W;, j 
is an independent normal random variable. Then, for every € € (0,3) we have 


>| ames 


IIx||? 


«| < Qemen/6, 
Proof. Without loss of generality we can assume that ||x|?7 = 1. Therefore, an 
equivalent inequality is 
P la —e)n < ||Wx|? < (14+ e)n| > 12677 n/6, 
Let w; be the ith row of W. The random variable (w;,x) is a weighted sum of 


d independent normal random variables and therefore it is normally distributed 
with zero mean and variance di ue = ||x|? = 1. Therefore, the random variable 


|| Wx||? = yoy (wi, x x))? has a x2 distribution The claim now follows directly from 
a measure preci aon property of x7 random variables stated in Lemma B.12 
given in Section B.7. oO 


The Johnson-Lindenstrauss lemma follows from this using a simple union bound 
argument. 


Lemma 23.4 (Johnson-Lindenstrauss Lemma). Let Q be a finite set of vectors in 
R¢. Let 6 € (0,1) and n be an integer such that 


_ [Sox @I0V78) _, 
yes 


Then, with probability of at least 1 — 6 over a choice of a random matrix W € R"4 
such that each element of W is distributed normally with zero mean and variance of 
1/n we have 


W 2 
| Wx -| 


Uu 
xeQ IIx||? 
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Proof. Combining Lemma 23.3 and the union bound we have that for every € € 
(0, 3): 
P |sup 
xeQ 


Let 5 denote the right-hand side of the inequality; thus we obtain that 
OCD 
n 


Interestingly, the bound given in Lemma 23.4 does not depend on the original 
dimension of x. In fact, the bound holds even if x is in an infinite dimensional Hilbert 
space. 


|| Wx||? 


-—1 
[Ixl|? 


5 J <2|Qle "6, 
Oo 


23.3 COMPRESSED SENSING 


Compressed sensing is a dimensionality reduction technique which utilizes a prior 
assumption that the original vector is sparse in some basis. To motivate compressed 
sensing, consider a vector x € R¢ that has at most s nonzero elements. That is, 


def ,,. 
Ixllo = lisa: 4 OH <s. 


Clearly, we can compress x by representing it using s (index,value) pairs. Fur- 
thermore, this compression is lossless — we can reconstruct x exactly from the s 
(index,value) pairs. Now, lets take one step forward and assume that x = Ua, where 
a is a sparse vector, ||o||9 <s, and U is a fixed orthonormal matrix. That is, x has a 
sparse representation in another basis. It turns out that many natural vectors are (at 
least approximately) sparse in some representation. In fact, this assumption under- 
lies many modern compression schemes. For example, the JPEG-2000 format for 
image compression relies on the fact that natural images are approximately sparse 
in a wavelet basis. 

Can we still compress x into roughly s numbers? Well, one simple way to do this 
is to multiply x by U', which yields the sparse vector «, and then represent a by its s 
(index,value) pairs. However, this requires us first to “sense” x, to store it, and then 
to multiply it by U'. This raises a very natural question: Why go to so much effort 
to acquire all the data when most of what we get will be thrown away? Cannot we 
just directly measure the part that will not end up being thrown away? 

Compressed sensing is a technique that simultaneously acquires and compresses 
the data. The key result is that a random linear transformation can compress x with- 
out losing information. The number of measurements needed is order of slog (d). 
That is, we roughly acquire only the important information about the signal. As we 
will see later, the price we pay is a slower reconstruction phase. In some situations, 
it makes sense to save time in compression even at the price of a slower reconstruc- 
tion. For example, a security camera should sense and compress a large amount of 
images while most of the time we do not need to decode the compressed data at 
all. Furthermore, in many practical applications, compression by a linear transfor- 
mation is advantageous because it can be performed efficiently in hardware. For 
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example, a team led by Baraniuk and Kelly has proposed a camera architecture 
that employs a digital micromirror array to perform optical calculations of a linear 
transformation of an image. In this case, obtaining each compressed measurement 
is as easy as obtaining a single raw measurement. Another important application 
of compressed sensing is medical imaging, in which requiring fewer measurements 
translates to less radiation for the patient. 

Informally, the main premise of compressed sensing is the following three 
“surprising” results: 


1. It is possible to reconstruct any sparse signal fully if it was compressed by 
xt> Wx, where W is a matrix which satisfies a condition called the Restricted 
Isoperimetric Property (RIP). A matrix that satisfies this property is guar- 
anteed to have a low distortion of the norm of any sparse representable 
vector. 

2. The reconstruction can be calculated in polynomial time by solving a linear 
program. 

3. Arandom n x d matrix is likely to satisfy the RIP condition provided that n is 
greater than an order of s log(d). 


Formally, 


Definition 23.5 (RIP). A matrix W € R” is (e,s)-RIP if for allx 4 0s.t. ||xl|o < s we 
have 
W 2 
wat |, 
IIx|I5 


The first theorem establishes that RIP matrices yield a lossless compression 
scheme for sparse vectors. It also provides a (nonefficient) reconstruction scheme. 


Theorem 23.6. Let « < 1 and let W be a (€,2s)-RIP matrix. Let x be a vector S.t. 
Ix|lo <5, let y = Wx be the compression of x, and let 


x € argmin ||V||o 
v:Wv=y 


be a reconstructed vector. Then, X =x. 


Proof. We assume, by way of contradiction, that k 4 x. Since x satisfies the con- 
straints in the optimization problem that defines x we clearly have that ||X||o < ||x|lo < 
s. Therefore, ||x — X||9 < 2s and we can apply the RIP inequality on the vector x — x. 
But, since W(x — x) = 0 we get that |0— 1] < €, which leads toa contradiction. O 


The reconstruction scheme given in Theorem 23.6 seems to be nonefficient 
because we need to minimize a combinatorial objective (the sparsity of v). Quite 
surprisingly, it turns out that we can replace the combinatorial objective, ||v||o, with 
a convex objective, ||v||1, which leads to a linear programming problem that can be 
solved efficiently. This is stated formally in the following theorem. 


Theorem 23.7. Assume that the conditions of Theorem 23.6 holds and that € < amet 
Then, 


x = argmin||v|lo = argmin|lv||1. 
v:Wv=y v:Wv=y 
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In fact, we will prove a stronger result, which holds even if x is not a sparse 
vector. 


Theorem 23.8. Let < as and let W be a (€,2s)-RIP matrix. Let x be an arbitrary 
vector and denote 
Xs € argmin ||x — y||1. 
v:llVllo<s 
That is, xs is the vector which equals x on the s largest elements of x and equals 0 
elsewhere. Let y = Wx be the compression of x and let 


x* € argmin ||Vv||1 
v:Wv=y 


be the reconstructed vector. Then, 


1+p ip 


Ix* —x|l2 < “laa |x —Xs|l1, 


where p = V/2e/(1—€). 


Note that in the special case that x = x; we get an exact recovery, x* = x, so 
Theorem 23.7 is a special case of Theorem 23.8. The proof of Theorem 23.8 is given 
in Section 23.3.1. 

Finally, the third result tells us that random matrices with n > Q(slog(d)) are 
likely to be RIP. In fact, the theorem shows that multiplying a random matrix by an 
orthonormal matrix also provides an RIP matrix. This is important for compressing 
signals of the form x = Ua where x is not sparse but @ is sparse. In that case, if W isa 
random matrix and we compress using y = Wx then this is the same as compressing 
a by y=(WU)a and since WU is also RIP we can reconstruct « (and thus also x) 
from y. 


Theorem 23.9. Let U be an arbitrary fixed d x d orthonormal matrix, let €, 5 be scalars 
in (0,1), let s be an integer in |d], and let n be an integer that satisfies 


> 100 slog one €)) 
€ 


Let W € R"4 be a matrix s.t. each element of W is distributed normally with zero 
mean and variance of 1/n. Then, with proabability of at least 1 — 6 over the choice of 
W, the matrix WU is (€,s)-RIP. 


23.3.1 Proofs* 


Proof of Theorem 23.8 
We follow a proof due to Candés (2008). 

Let h = x* —x. Given a vector v and a set of indices J we denote by v; the vector 
whose ith element is v; if i €¢ J and O otherwise. 

The first trick we use is to partition the set of indices [d] = {1,..., d} into disjoint 
sets of size s. That is, we will write [d] = Tp U T; UT)... Tg/s—1 where for all i, |7;| = 
s, and we assume for simplicity that d/s is an integer. We define the partition as 
follows. In To we put the s indices corresponding to the s largest elements in absolute 
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values of x (ties are broken arbitrarily). Let Ty =[d]\ To. Next, 7; will be the s indices 
corresponding to the s largest elements in absolute value of hye. Let To. = ToUN 
and Ty, = [d] \ 70,1. Next, Tz will correspond to the s largest elements in absolute 
value of hye . And, we will construct 73, 74,... in the same way. 

To prove the theorem we first need the following lemma, which shows that RIP 
also implies approximate orthogonality. 


Lemma 23.10 Let W be an (€,2s)-RIP matrix. Then, for any two disjoint sets I, J, 
both of size at most s, and for any vector u we have that (Wu,;, Wuz) < €|luz|l2 ||uy|l2. 


Proof. W.l.o.g. assume ||uz||2 = ||uy||2 = 1. 


Wu, + Wuy||2 — || Wu; — Wu,||2 
(Wwuy, Wu) = # I ths I aby 


But, since |J U/| < 2s we get from the RIP condition that || Wu; + Wu, I5 <(1+ 
€)(lur[l5 + ls I3) = 2. +) and that —|| Wu, — Wuy||3 < —(1 —e)(llur 5 + [lay lI3) = 
—2(1 —e), which concludes our proof. Oo 


We are now ready to prove the theorem. Clearly, 
I[hall2 = [la7,, + ze lla S [lz Ilo + Ilaze, Il2- (23.5) 
To prove the theorem we will show the following two claims: 


Claim 1: [hye ||2 < |[h7 |l2 +2871? Ix — xsl. 


. 9. = 
Claim 2: Ihr, ; 2 < $5817 (IK —XsIl1- 


Combining these two claims with Equation (23.5) we get that 


[Ihll2 < [hry [lo + Uae. Ilo < 2 [lay 2 +257"? Ix — xs lh 


<2(2541)s7Ix— xh 


1+ 
P 1/2 


=2 
L= 


|k—Xsll1, 


and this will conclude our proof. 


Proving Claim 1: 

To prove this claim we do not use the RIP condition at all but only use the fact that 
x* minimizes the ¢; norm. Take j > 1. For eachi € T; and i’ € T;_1 we have that 
|hi| < |hj|. Therefore, ||h7; lloo < ||h7,_, |l1/s. Thus, 


1/2 —1/2 
Ihr; lo <5" IIh7, loo <8 Ih, Ih. 


Summing this over j = 2,3,... and using the triangle inequality we obtain that 


[Ine Ilo <0 Mir; Ilo < 971? ge ll (23.6) 
j22 
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Next, we show that ||hy«||1 cannot be large. Indeed, from the definition of x* we 
have that ||x||1 > |[x*||1 = ||x+hl|1. Thus, using the triangle inequality we obtain that 


Ixila > x +l = So be tel + DO xe + hal = X79 1 — Mp lla + Mage Ula = [Xe Ih 
ie€To ieTy 
(23.7) 
and since |[xz¢|1 = [Ik — Xsll1 = [Ixll1 — [IX7ll1 we get that 


[Ire |] < [az It + 2[1X7¢ [l1. (23.8) 

Combining this with Equation (23.6) we get that 
[Ire lla <9? (tly + 2lbxre lt) < Mya [2 + 28"? leery. 

which concludes the proof of claim 1. 
Proving Claim 2: 
For the second claim we use the RIP condition to get that 

(1 — ella 3 < Wir 3. (23.9) 
Since Why,, = Wh— >). Whr; = — >) j>2 Why, we have that 

Why 3 =—>_ (Why, Whr,) =—>_ (Why, + Whr,, Wh7,). 
j22 j22 


From the RIP condition on inner products we obtain that for all i € {1,2} and j >2 
we have 
|(Whz,;, Wh, )| < €||lv7; ||2IIv7; |l2- 


Since ||hz, ||2 + |[la7, ll2 < V2\|h7 , lz we therefore get that 


[Why I5 < ¥2e ly, Ilo > Illa7; Ilo. 
j22 


Combining this with Equation (23.6) and Equation (23.9) we obtain 
(1 =e) Ihr, 5 < V2e [hur los '/ thre | 
Rearranging the inequality gives 


We 
IIa Ho << ps Uhre lh. 


Finally, using Equation (23.8) we get that 


-1/2 


[Iz lo < ps’? (hry lla + 21kXr¢ 11) < play ll2 +2057? xr Il, 


but since ||hz, ||2 < ||, , ||2 this implies 


20 | 
Uhr < 7s MP lixre ll, 


which concludes the proof of the second claim. 
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Proof of Theorem 23.9 
To prove the theorem we follow an approach due to (Baraniuk, Davenport, DeVore 
& Wakin 2008). The idea is to combine the Johnson-Lindenstrauss (JL) lemma with 
a simple covering argument. 

We start with a covering property of the unit ball. 


d 
Lemma 23.11 Let € (0,1). There exists a finite set Q C R4 of size |Q| < (2) such 
that 


sup min ||x—vl| < e. 
x:|[x||<1 YS 


Proof. Let k be an integer and let 
QO! = {xe R*:Vj €[d], 51 e{-k, -k+1,...,k} s.t.xj = 4}. 


Clearly, |Q’| = (2k +1)“. We shall set @ = Q’N B2(1), where B2(1) is the unit ¢> ball 
of R¢. Since the points in Q’ are distributed evenly on the unit £. ball, the size of Q 
is the size of Q’ times the ratio between the volumes of the unit £2 and £. balls. The 
volume of the £5 ball is 27 and the volume of B2(1) is 


4/2 
rd+d/2) 


For simplicity, assume that d is even and therefore 


r(-+4/2) = (4/2)! > (ey, 


where in the last inequality we used Stirling’s approximation. Overall we obtained 
that 
|Q| < (2k +1)4 (x/e)*” (d/2)-47 2-4. (23.10) 


Now lets us specify k. For each x € B2(1) let v € Q be the vector whose ith element 
is sign(x;) ||x;|k|/k. Then, for each element we have that |x; — v;| < 1/k and thus 


Jd 


|x — vl] < —. 
k 


To ensure that the right-hand side will be at most € we shall set k = [/d/e]. Plugging 
this value into Equation (23.10) we conclude that 


Ol < BVd/(26))! (ae)? (a/2)-4? = (2 ah é (2)". 
O 


Let x be a vector that can be written as x = Ua with U being some orthonormal 
matrix and ||@||9 < s. Combining the earlier covering property and the JL lemma 
(Lemma 23.4) enables us to show that a random W will not distort any such x. 


Lemma 23.12. Let U be an orthonormal d x d matrix and let I Cc [d] be a set of 
indices of size |I| = s. Let S be the span of {U; :i € I}, where U; is the ith column of 
U. Let 6 € (0,1), « € (0,1), andn EN such that 


= 24 log (2/5) + : log(12/e) 
€ 
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Then, with probability of at least 1 — 6 over a choice of a random matrix W € R"4 
such that each element of W is independently distributed according to N(0,1/n), we 
have 


Ww 
Wal 


xes | (Ix! 


Proof. It suffices to prove the lemma for all x € S with ||x|| = 1. We can write x = Ua 
where a € R’, ||a||2 = 1, and U; is the matrix whose columns are {U; :i € /}. Using 
Lemma 23.11 we know that there exists a set Q of size |Q| < (12/e)* such that 


sup min ||a — v|| < (€/4). 
:||ar|=1 VSP 


But since U is orthogonal we also have that 


sup min || Ure — Urv|| < (€/4). 


a:|jo]=1 VS 


Applying Lemma 23.4 on the set {U;v: v € Q} we obtain that for n satisfying the 
condition given in the lemma, the following holds with probability of at least 1 — 6: 


WU7y||" 
celica | 
veo| WUrvil 
This also implies that 
WU 
sup ea - i <e€/2. 
veo| |lUrv|l 
Let a be the smallest number such that 
W. 
egy A ea, 


IIx! 
Clearly a < oo. Our goal is to show that a < e. This follows from the fact that for any 
x € S of unit norm there exists v € Q such that ||x — U;v|| < «/4 and therefore 


| Wxl| < |WUrv|| + ||W(x-Ury)|| <1+¢/2+(1+a)e/4. 


Thus, 
|| Wx\| 


Vx eS, 
IIx|| 


<1+ (€/2+(1+a)e/4). 
But the definition of a implies that 


€/2+e/4 
Ze 4 Pe a 
a<e¢/2+(1+a)c/4 > ax 1/4 <e 


| Wx 


mM 1 <e. The other side follows from this 


This proves that for all x € S we have 
as well since 


|Wxl] = ||WUrv| — |W(x- U;y)|| > 1—€/2—(A+e)e/421-. 


The preceding lemma tells us that for x € S of unit norm we have 


(1—e€) <||Wx|| <(1 +e), 
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which implies that 
(1—2e) <||Wxl|? < (1436). 


The proof of Theorem 23.9 follows from this by a union bound over all choices of J. 


23.4 PCA OR COMPRESSED SENSING? 


Suppose we would like to apply a dimensionality reduction technique to a given set 
of examples. Which method should we use, PCA or compressed sensing? In this 
section we tackle this question, by underscoring the underlying assumptions behind 
the two methods. 

It is helpful first to understand when each of the methods can guarantee per- 
fect recovery. PCA guarantees perfect recovery whenever the set of examples is 
contained in an n dimensional subspace of R¢. Compressed sensing guarantees 
perfect recovery whenever the set of examples is sparse (in some basis). On the 
basis of these observations, we can describe cases in which PCA will be better than 
compressed sensing and vice versa. 

As a first example, suppose that the examples are the vectors of the standard 
basis of R¢, namely, e;,...,e@7, where each e; is the all zeros vector except 1 in the ith 
coordinate. In this case, the examples are 1-sparse. Hence, compressed sensing will 
yield a perfect recovery whenever n > Q(log(d)). On the other hand, PCA will lead 
to poor performance, since the data is far from being in an n dimensional subspace, 
as long as n < d. Indeed, it is easy ro verify that in such a case, the averaged recovery 
error of PCA (i.e., the objective of Equation (23.1) divided by m) will be (d —n)/d, 
which is larger than 1/2 whenever n < d/2. 

We next show a case where PCA is better than compressed sensing. Consider 
m examples that are exactly on an n dimensional subspace. Clearly, in such a case, 
PCA will lead to perfect recovery. As to compressed sensing, note that the exam- 
ples are n-sparse in any orthonormal basis whose first n vectors span the subspace. 
Therefore, compressed sensing would also work if we will reduce the dimension 
to Q(nlog(d)). However, with exactly n dimensions, compressed sensing might 
fail. PCA has also better resilience to certain types of noise. See (Chang, Weiss 
& Freeman 2009) for a discussion. 


23.5 SUMMARY 


We introduced two methods for dimensionality reduction using linear transforma- 
tions: PCA and random projections. We have shown that PCA is optimal in the 
sense of averaged squared reconstruction error, if we restrict the reconstruction pro- 
cedure to be linear as well. However, if we allow nonlinear reconstruction, PCA is 
not necessarily the optimal procedure. In particular, for sparse data, random projec- 
tions can significantly outperform PCA. This fact is at the heart of the compressed 
sensing method. 


23.6 BIBLIOGRAPHIC REMARKS 


PCA is equivalent to best subspace approximation using singular value decompo- 
sition (SVD). The SVD method is described in Appendix C. SVD dates back to 
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Eugenio Beltrami (1873) and Camille Jordan (1874). It has been rediscovered many 
times. In the statistical literature, it was introduced by Pearson (1901). Besides PCA 
and SVD, there are additional names that refer to the same idea and are being 
used in different scientific communities. A few examples are the Eckart- Young the- 
orem (after Carl Eckart and Gale Young who analyzed the method in 1936), the 
Schmidt-Mirsky theorem, factor analysis, and the Hotelling transform. 

Compressed sensing was introduced in Donoho (2006) and in (Candes & Tao 
2005). See also Candes (2006). 


23.7 EXERCISES 


23.1 In this exercise we show that in the general case, exact recovery of a linear 
compression scheme is impossible. 
1. let A € R’“ be an arbitrary compression matrix where n < d—1. Show that there 
exists u,v € R”, u#v such that Au= Ay. 
2. Conclude that exact recovery of a linear compression scheme is impossible. 
23.2 Let w € R@ such that a; > a > -+- > ag > 0. Show that 


max yy Bj = pe 


\d. 
e[0.1}4:1BIn<n rer 


Hint: Take every vector B € [0,1]? such that ||6||, <n. Let i be the minimal index 
for which £; < 1. Ifi =n +1 we are done. Otherwise, show that we can increase §;, 
while possibly decreasing 6; for some j >, and obtain a better solution. This will 
imply that the optimal solution is to set 6; = 1 fori <n and 6; =0 fori >n. 

23.3 Kernel PCA: In this exercise we show how PCA can be used for construct- 
ing nonlinear dimensionality reduction on the basis of the kernel trick (see 
Chapter 16). 

Let © be some instance space and let S = {xj,...,X,,} be a set of points in ¥. 
Consider a feature mapping y : ¥ — V, where V is some Hilbert space (possi- 
bly of infinite dimension). Let K : ¥ x X be a kernel function, that is, k(x,x’) = 
(w(x), w(x’)). Kernel PCA is the process of mapping the elements in S$ into V 
using y, and then applying PCA over {W(x1),..., W(Xm)} into R”. The output of 
this process is the set of reduced elements. 

Show how this process can be done in polynomial time in terms of m and n, 
assuming that each evaluation of K(-,-) can be calculated in a constant time. In 
particular, if your implementation requires multiplication of two matrices A and 
B, verify that their product can be computed. Similarly, if an eigenvalue decom- 
position of some matrix C is required, verify that this decomposition can be 


computed. 

23.4 An Interpretation of PCA as Variance Maximization: 
Let x1,...,X,, be m vectors in R@, and let x be arandom vector distributed according 
to the uniform distribution over xj, ...,X,,. Assume that E[x] =0 


1. Consider the problem of finding a unit vector, w ¢ R“, such that the ran- 
dom variable (w, x) has maximal variance. That is, we would like to solve the 
problem 

m 


argmax Var[(w, x) ] = argmax — » ((w, x;))°. 
w:||w||=1 w:||wl|=1 17) 77 
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Show that the solution of the problem is to set w to be the first principle vector 
of x1,...,Xm- 


. Let w; be the first principal component as in the previous question. Now, sup- 


pose we would like to find a second unit vector, w2 € R¢, that maximizes the 
variance of (w2,x), but is also uncorrelated to (w,, x). That is, we would like to 
solve 
argmax Var[(w, x) ]. 

w:||wll=1, E[((w1,x))({w,x))]=0 
Show that the solution to this problem is to set w to be the second principal 
component of x1, ...,Xm- 
Hint: Note that 


= [((wi, x))((w, x))] = wy} E[xx! Jw =mw, Aw, 


where A = 97; x;x; . Since w is an eigenvector of A we have that the constraint 
E [((w1, x))((w, x))] = 0 is equivalent to the constraint 


(wi, w) =0. 


23.5 The Relation between SVD and PCA: Use the SVD theorem (Corollary C.6) for 
providing an alternative proof of Theorem 23.2. 

23.6 Random Projections Preserve Inner Products: The Johnson-Lindenstrauss lemma 
tells us that a random projection preserves distances between a finite set of vectors. 
In this exercise you need to prove that if the set of vectors are within the unit ball, 
then not only are the distances between any two vectors preserved, but the inner 
product is also preserved. 


Let Q be a finite set of vectors in R? and assume that for every x € Q we have 


|x|] <1. 
1. Let 6 € (0, 1) and n be an integer such that 


<=, [SlealOP/s) _ 5 
n 


Prove that with probability of at least 1— 6 over a choice of a random matrix 
W <R"4, where each element of W is independently distributed according to 
N (0, 1/n), we have 

|(Wu, Wy) — (u, v)| <€ 


for every u,veé Q. 
ee |W (u+v)|l |W(u—y) || 
Hint: Use JL to bound both a7 and 7 


. (*) Let x1,...,X,, be a set of vectors in R¢ of norm at most 1, and assume that 


these vectors are linearly separable with margin of y. Assume that d > 1/y?. 
Show that there exists a constant c > 0 such that if we randomly project these 
vectors into R”, for n = c/y?, then with probability of at least 99% it holds that 
the projected vectors are linearly separable with margin y /2. 
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We started this book with a distribution free learning framework; namely, we did not 
impose any assumptions on the underlying distribution over the data. Furthermore, 
we followed a discriminative approach in which our goal is not to learn the underly- 
ing distribution but rather to learn an accurate predictor. In this chapter we describe 
a generative approach, in which it is assumed that the underlying distribution over 
the data has a specific parametric form and our goal is to estimate the parameters of 
the model. This task is called parametric density estimation. 

The discriminative approach has the advantage of directly optimizing the quan- 
tity of interest (the prediction accuracy) instead of learning the underlying distri- 
bution. This was phrased as follows by Vladimir Vapnik in his principle for solving 
problems using a restricted amount of information: 


When solving a given problem, try to avoid a more general problem as an intermediate 
step. 


Of course, if we succeed in learning the underlying distribution accurately, we 
are considered to be “experts” in the sense that we can predict by using the Bayes 
optimal classifier. The problem is that it is usually more difficult to learn the underlying 
distribution than to learn an accurate predictor. However, in some situations, it is 
reasonable to adopt the generative learning approach. For example, sometimes it 
is easier (computationally) to estimate the parameters of the model than to learn a 
discriminative predictor. Additionally, in some cases we do not have a specific task at 
hand but rather would like to model the data either for making predictions at a later 
time without having to retrain a predictor or for the sake of interpretability of the data. 

We start with a popular statistical method for estimating the parameters of the 
data, which is called the maximum likelihood principle. Next, we describe two gen- 
erative assumptions which greatly simplify the learning process. We also describe 
the EM algorithm for calculating the maximum likelihood in the presence of latent 
variables. We conclude with a brief description of Bayesian reasoning. 


. MAXIMUM LIKELIHOOD ESTIMATOR 


Let us start with a simple example. A drug company developed a new drug to treat 
some deadly disease. We would like to estimate the probability of survival when 
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using the drug. To do so, the drug company sampled a training set of m people and 
gave them the drug. Let S = (x1,...,x) denote the training set, where for each i, 
x; = 1if the ith person survived and x; = 0 otherwise. We can model the underlying 
distribution using a single parameter, 6 € [0, 1], indicating the probability of survival. 

We now would like to estimate the parameter 6 on the basis of the training set S. 
A natural idea is to use the average number of 1’s in S as an estimator. That is, 


2 (24.1) 


Clearly, Es [4] = 0. That is, @ is an unbiased estimator of 6. Furthermore, since @ is 
the average of m i.i.d. binary random variables we can use Hoeffding’s inequality to 
get that with probability of at least 1 — 6 over the choice of S we have that 

log (2/6) 


|@-| < 3a. (24.2) 


Another interpretation of 6 is as the Maximum Likelihood Estimator, as we 
formally explain now. We first write the probability of generating the sample S: 


P[S = (x1,....4m)] =] [0% 1-0) S027 (1-2), 
i=1 
We define the log likelihood of S, given the parameter 0, as the log of the preceding 
expression: 


L(S;6) =log (P[S = (x1,.-.,%m)]) = log (6) > i +log(1— 6) (1 — xj). 


The maximum likelihood estimator is the parameter that maximizes the likelihood 


6 € argmax L(S;6). (24.3) 
6 


Next, we show that in our case, Equation (24.1) is a maximum likelihood estimator. 
To see this, we take the derivative of L(S;@) with respect to 6 and equate it to zero: 


Diti_ LiGd=*i) _4 
é 1-86 


Solving the equation for 6 we obtain the estimator given in Equation (24.1). 


24.1.1 Maximum Likelihood Estimation for Continuous 
Random Variables 


Let X be a continuous random variable. Then, for most x € R we have P[X =x] =0 
and therefore the definition of likelihood as given before is trivialized. To overcome 
this technical problem we define the likelihood as log of the density of the probabil- 
ity of X at x. That is, given an iid. training set S = (x,...,x) sampled according 
to a density distribution Pg we define the likelihood of S given 6 as 


L(S;0) = log (11 Puts)) = “log (Po(x:)). 
i=l 


i=1 
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As before, the maximum likelihood estimator is a maximizer of L(S;0) with respect 
to é. 

As an example, consider a Gaussian random variable, for which the density 
function of X is parameterized by 6 = (w,o) and is defined as follows: 


jo aye 
Po(x) = = exp (-*) : 


We can rewrite the likelihood as 


L(S;0) = sa — pn) —mlog(o V2z). 
isl 


To find a parameter 6 = (1,0) that optimizes this we take the derivative of the 
likelihood w.r.t. w and w.r.t. o and compare it to 0. We obtain the following two 
equations: 


d 1 m 
— Jf, 5 = 7 to = 0 
at (S9) goaG 1) 


m 
m 


d 1 
—f . — —_ = 2 -— = 0 
FelOO= Fo WP 3 


Solving the preceding equations we obtain the maximum likelihood estimates: 


Note that the maximum likelihood estimate is not always an unbiased estimator. 
For example, while jz is unbiased, it is possible to show that the estimate ¢ of the 
variance is biased (Exercise 24.1). 


Simplifying Notation 

To simplify our notation, we use P[X = x] in this chapter to describe both the prob- 
ability that X = x (for discrete random variables) and the density of the distribution 
at x (for continuous variables). 


24.1.2 Maximum Likelihood and Empirical Risk Minimization 


The maximum likelihood estimator shares some similarity with the Empirical Risk 
Minimization (ERM) principle, which we studied extensively in previous chapters. 
Recall that in the ERM principle we have a hypothesis class H and we use the 
training set for choosing a hypothesis h € H that minimizes the empirical risk. We 
now show that the maximum likelihood estimator is an ERM for a particular loss 
function. 

Given a parameter @ and an observation x, we define the loss of 6 on x as 


e(0,x) = —log(Pa[x]). (24.4) 
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That is, £(0,x) is the negation of the log-likelihood of the observation x, assum- 
ing the data is distributed according to Pg. This loss function is often referred to as 
the log-loss. On the basis of this definition it is immediate that the maximum likeli- 
hood principle is equivalent to minimizing the empirical risk with respect to the loss 
function given in Equation (24.4). That is, 


m 


een x (—log(Pe[xi])) = argmax ) | log (Po[xi]). 
i=1 i=l 


Assuming that the data is distributed according to a distribution P (not necessarily 
of the parametric form we employ), the true risk of a parameter 0 becomes 


E[¢(4,x)| =—)_ Plx]log (Po[x]) 


= EP (Fe) +) Plzliog (=) (24.5) 


DrelP|IPo] H(P) 


where Dpg is called the relative entropy, and H is called the entropy function. The 
relative entropy is a divergence measure between two probabilities. For discrete 
variables, it is always nonnegative and is equal to 0 only if the two distributions are 
the same. It follows that the true risk is minimal when Py = P. 

The expression given in Equation (24.5) underscores how our generative 
assumption affects our density estimation, even in the limit of infinite data. It shows 
that if the underlying distribution is indeed of a parametric form, then by choos- 
ing the correct parameter we can make the risk be the entropy of the distribution. 
However, if the distribution is not of the assumed parametric form, even the best 
parameter leads to an inferior model and the suboptimality is measured by the 
relative entropy divergence. 


4.1.3 Generalization Analysis 


How good is the maximum likelihood estimator when we learn from a finite 
training set? 

To answer this question we need to define how we assess the quality of an approx- 
imated solution of the density estimation problem. Unlike discriminative learning, 
where there is a clear notion of “loss,” in generative learning there are various ways 
to define the loss of a model. On the basis of the previous subsection, one natural 
candidate is the expected log-loss as given in Equation (24.5). 

In some situations, it is easy to prove that the maximum likelihood principle 
guarantees low true risk as well. For example, consider the problem of estimat- 
ing the mean of a Gaussian variable of unit variance. We saw previously that the 
maximum likelihood estimator is the average: i) = ty; x;. Let z* be the optimal 
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parameter. Then, 


‘ ig =) 
q L(p, x) — €(u*, x) = i ~~ lo as 
er (x) — OH") x~N(u*,1) ( Palx] 


= Ey (ge H+ 30a) 


x~N(u*,1) 
a2 x)2 
pe (u*) a" . 
a) 2 + in ey 
a2 *x)2 
UL LU * A * 
=H _S z + (u*— fi) 
if A * 
= zh mo ee (24.6) 


Next, we note that jz is the average of m Gaussian variables and therefore it is also 
distributed normally with mean j* and variance o*/m. From this fact we can derive 
bounds of the form with probability of at least 1— 5 we have that |@ — *| < « where 
€ depends on o*/m and on 6. 

In some situations, the maximum likelihood estimator clearly overfits. For exam- 
ple, consider a Bernoulli random variable X and let P[X = 1] = 0*. As we saw 
previously, using Hoeffding’s inequality we can easily derive a guarantee on |6* — 6| 
that holds with high probability (see Equation (24.2)). However, if our goal is to 
obtain a small value of the expected log-loss function as defined in Equation (24.5) 
we might fail. For example, assume that 6* is nonzero but very small. Then, the 
probability that no element of a sample of size m will be 1 is (1 — 6*)”, which is 
greater than e~*°"”. It follows that whenever m < 2), the probability that the 
sample is all zeros is at least 50%, and in that case, the maximum likelihood rule will 
set 9 = 0. But the true risk of the estimate 6 = 0 is 


E[e@,x)] = 0746.1) + — 670, 0) 


= 6* log (1/8) + (1 — 6*) log (1/(1 — 4)) 
= 6* log (1/0) = ov. 


This simple example shows that we should be careful in applying the maximum 
likelihood principle. 

To overcome overfitting, we can use the variety of tools we encountered previ- 
ously in the book. A simple regularization technique is outlined in Exercise 24.2. 


24.2 NAIVE BAYES 


The Naive Bayes classifier is a classical demonstration of how generative assump- 
tions and parameter estimations simplify the learning process. Consider the problem 
of predicting a label y € {0,1} on the basis of a vector of features x = (x1,...,xa), 
where we assume that each x; is in {0, 1}. Recall that the Bayes optimal classifier is 


hBayes(X) = argmax P[Y = y|X =x]. 
ye{0,1} 
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To describe the probability function P[Y = y|X =x] we need 2/ parameters, each 
of which corresponds to P[Y = 1|X = x] for a certain value of x € {0,1}¢. This 
implies that the number of examples we need grows exponentially with the number 
of features. 

In the Naive Bayes approach we make the (rather naive) generative assumption 
that given the label, the features are independent of each other. That is, 


d 
P[X =xl¥ =y]=| [PIX =xil¥ =y]. 
i=1 


With this assumption and using the Bayes rule, the Bayes optimal classifier can be 
further simplified: 


hBayes(X) = argmax P[Y =y|X= x] 


ye{0.1} 
= argmaxP[Y = yJP[X=x|Y=y] 
ye{0,1} 
d 
= argmaxP[Y = yf [Pe =x;|Y =y]. (24.7) 
ye{0,1} i=l 


That is, now the number of parameters we need to estimate is only 2d + 1. Here, the 
generative assumption we made reduced significantly the number of parameters we 
need to learn. 

When we also estimate the parameters using the maximum likelihood principle, 
the resulting classifier is called the Naive Bayes classifier. 


24.3 LINEAR DISCRIMINANT ANALYSIS 


Linear discriminant analysis (LDA) is another demonstration of how generative 
assumptions simplify the learning process. As in the Naive Bayes classifier we con- 
sider again the problem of predicting a label y € {0,1} on the basis of a vector of 
features x = (x1,...,x¢). But now the generative assumption is as follows. First, 
we assume that P[Y = 1] = P[Y = 0] = 1/2. Second, we assume that the condi- 
tional probability of X given Y is a Gaussian distribution. Finally, the covariance 
matrix of the Gaussian distribution is the same for both values of the label. Formally, 
let to, 4; € R@ and let © be a covariance matrix. Then, the density distribution is 
given by 


1 1 = 
P[X =x|¥ =y]= Gaya oP (-Fx-0,)"2 '@-1,)) : 


As we have shown in the previous section, using the Bayes rule we can write 


hBayes(X) = argmax P[Y = y]P[X =x|Y = y]. 
ye{0,1} 


This means that we will predict hgayes(x) = 1 iff 


PLY =1]P[X =x|Y =1] 
= (Fr = 0|P[X =x|Y = 7) i 
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This ratio is often called the log-likelihood ratio. 
In our case, the log-likelihood ratio becomes 


5(x— plo)” ZK = Mo) — $(K— hy)” ZK My) 


We can rewrite this as (w, x) + b where 
w= (Hi —mo)E! and b=4 (wp E mo ePE a). (24.8) 


As a result of the preceding derivation we obtain that under the aforementioned 
generative assumptions, the Bayes optimal classifier is a linear classifier. Addition- 
ally, one may train the classifier by estimating the parameter fp, “4, and Y from the 
data, using, for example, the maximum likelihood estimator. With those estimators 
at hand, the values of w and b can be calculated as in Equation (24.8). 


24.4 LATENT VARIABLES AND THE EM ALGORITHM 


In generative models we assume that the data is generated by sampling from a spe- 
cific parametric distribution over our instance space 1’. Sometimes, it is convenient 
to express this distribution using latent random variables. A natural example is a 
mixture of k Gaussian distributions. That is, 4” = .R¢ and we assume that each x is 
generated as follows. First, we choose a random number in {1,...,k}. Let Y be a 
random variable corresponding to this choice, and denote P[Y = y] = cy. Second, 
we choose x on the basis of the value of Y according to a Gaussian distribution 


1 it _ 
P[X =x|Y=y]= Gays, exp (3 — jay 25 Vx H)) . (24.9) 


Therefore, the density of X can be written as: 


k 


P[X=x] = )UPIY=yIP[X=xl¥ =y] 
y=, 
: 1 1 ee 
= >.) GmyaRy, eo a(R by) By (K— By) J. 
y=1 = 


Note that Y is a hidden variable that we do not observe in our data. Nevertheless, 
we introduce Y since it helps us describe a simple parametric form of the probability 
of X. 

More generally, let 6 be the parameters of the joint distribution of X and Y (e.g., 
in the preceding example, 6 consists of cy, “, and Xy, for all y=1,...,). Then, the 
log-likelihood of an observation x can be written as 


k 


log (Po[X =x]) = log S > Pol[X =x, ¥=y] 


y=1 
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Given an i.i.d. sample, S = (x1,...,Xm), we would like to find 6 that maximizes 
the log-likelihood of S, 


m 


L(6) = log] [ PolX =xi] 
v1 


k 


= Se 
i=1 


The maximum-likelihood estimator is therefore the solution of the maximization 
problem 


Po[X = Xj, Y= y] 
1 


y= 


m k 


argmax L(6) = argmax S “log S > Po[X =x), ¥ =y] 
8 a i=1 y=1 


In many situations, the summation inside the log makes the preceding opti- 
mization problem computationally hard. The Expectation-Maximization (EM) algo- 
rithm, due to Dempster, Laird, and Rubin, is an iterative procedure for searching a 
(local) maximum of L(6@). While EM is not guaranteed to find the global maximum, 
it often works reasonably well in practice. 

EM is designed for those cases in which, had we known the values of the latent 
variables Y, then the maximum likelihood optimization problem would have been 
tractable. More precisely, define the following function over m x k matrices and the 
set of parameters 0: 


k 


F(0,0)=S_ S~ Qj, log (Po[X =x), =y]). 


f=1 yal 


If each row of Q defines a probability over the ith latent variable given X = x,, 
then we can interpret F(Q,6) as the expected log-likelihood of a training set 
(X1, ¥1),---»(Xm,¥m), Where the expectation is with respect to the choice of each y; 
on the basis of the ith row of Q. In the definition of F, the summation is outside 
the log, and we assume that this makes the optimization problem with respect to 0 
tractable: 


Assumption 24.1. For any matrix Q € [0,1]’“, such that each row of Q sums to 1, 
the optimization problem 
argmax F'(Q, 0) 
6 


is tractable. 


The intuitive idea of EM is that we have a “chicken and egg” problem. On one 
hand, had we known Q, then by our assumption, the optimization problem of finding 
the best 0 is tractable. On the other hand, had we known the parameters @ we could 
have set Q;,, to be the probability of Y = y given that X = x;. The EM algorithm 
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therefore alternates between finding @ given Q and finding Q given 6. Formally, 
EM finds a sequence of solutions (OM ,@), (Q@ 4), ... where at iteration t, we 
construct (Q¢+), 9+) by performing two steps. 


™ Expectation Step: Set 


1 
Of) = Py l¥ =y1X =x]. (24.10) 


This step is called the Expectation step, because it yields a new probability over 
the latent variables, which defines a new expected log-likelihood function over 0. 

™ Maximization Step: Set 6+!) to be the maximizer of the expected log- 
likelihood, where the expectation is according to Q“*)): 


6+) = argmax F(Q“t)), 6). (24.11) 
8 


By our assumption, it is possible to solve this optimization problem efficiently. 


The initial values of 0 and Q@) are usually chosen at random and the 
procedure terminates after the improvement in the likelihood value stops being 
significant. 


24.4.1 EM as an Alternate Maximization Algorithm 


To analyze the EM algorithm, we first view it as an alternate maximization 
algorithm. Define the following objective function 


m ek 
G(Q,6) = F(Q,0)—S_S > Qiylog(Qi.y). 
i=L ye=l 
The second term is the sum of the entropies of the rows of Q. Let 
k 
Q=4 Ge [0,1)"":vi, )>Oiy=1 
y=1 


be the set of matrices whose rows define probabilities over [k]. The following lemma 
shows that EM performs alternate maximization iterations for maximizing G. 


Lemma 24.2. The EM procedure can be rewritten as 


o"*) — argmax G(Q,6) 
QeQ 
o*) — argmaxG(Q"*)) 6), 
8 
Furthermore, G(Q“+)) 6) = L(@). 


Proof. Given Qt") we clearly have that 


argmax G(Q“t)), 9) = argmax F(Q“*!),@). 
6 6 
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Therefore, we only need to show that for any 6, the solution of argmaxg<g G(Q, 6) 
is to set Q;,y = Pe[Y = y|X =x;]. Indeed, by Jensen’s inequality, for any O € Q we 
have that 


i=l \y=1 


m k PilX = v= 
3 [i (s — 
i=1 es 


y=1 


m k 
= S “log [Soren =x;,Y= 7 
i=l y=1 


m k — _ 
H10.0=5- (Soiree BE EE=) 


= S “log (Pel[X =x;]) =L(0), 


i=1 
while for Q;,, = Pe[Y = y|X =x;] we have 


m k —/ '¢ = 
stare (Sentra anine(BEztt=2)) 


i=1 \y=1 


m ek 
=)°S> Poly =y|X =x;]log (Po[X =x;]) 


i=1 y=1 
k 


—| S “log (Po[X =xi]) S> PolY = y|X =x;] 
i=1 


y=1 


= 5 “log (Po[X =x;]) = L(6). 
i=1 
This shows that setting Q;,, = Pe[Y = y|X =x;] maximizes G(Q, 0) over OQ € Q and 
shows that G(Q¢+)),@) = L(@), O 
The preceding lemma immediately implies: 


Theorem 24.3. The EM procedure never decreases the log-likelihood; namely, 
for allt, 
LO"*)) > LO), 


Proof. By the lemma we have 


Lot) = G(Q"'t?), g@+D) > c(t), 9) = L(6). 


24.4.2 EM for Mixture of Gaussians (Soft k-Means) 


Consider the case of a mixture of k Gaussians in which @ is a triplet 
(c, {@y,..., Mg}, {D1,..., Ue}) where Pe[Y = y] =cy and Pe[X =x|Y = y] is as given 
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in Equation (24.9). For simplicity, we assume that ©; = X2 =--- = XL, = J, where 
I is the identity matrix. Specifying the EM algorithm for this case we obtain the 
following: 


H Expectation step: For each i € [m] and y € [k] we have that 


1 
PaolY = ylX =x] = Zz VOY = y] Po [X =xil¥ = y] 


1 1 2 

aa a exp (-5Im = w|| ) : (24.12) 
where Z; is a normalization factor which ensures that dy PawlY = y|X =x] 
sums to 1. 

™ Maximization step: We need to set 6’! to be a maximizer of Equation (24.11), 
which in our case amounts to maximizing the following expression w.r.t. e and pt: 


m ek 
SoS Pp l¥ = y|X =x] (ioe (,)- six - nl?) . (24.13) 


i=l y= 


Comparing the derivative of Equation (24.13) w.r.t. w, to zero and rearranging 
terms we obtain: 


m 
by =) Py lY = y1X =xi]xi. 
i=l 
That is, 2, is a weighted average of the x; where the weights are according to the 
probabilities calculated in the E step. To find the optimal ec we need to be more 
careful since we must ensure that ¢ is a probability vector. In Exercise 24.3 we 
show that the solution is 


k ‘ 
yas yy Pot) [Y = y |X — x;] 


It is interesting to compare the preceding algorithm to the k-means algorithm 
described in Chapter 22. In the k-means algorithm, we first assign each example to a 
cluster according to the distance ||x; — yz, ||. Then, we update each center yt, accord- 
ing to the average of the examples assigned to this cluster. In the EM approach, 
however, we determine the probability that each example belongs to each cluster. 
Then, we update the centers on the basis of a weighted sum over the entire sample. 
For this reason, the EM approach for k-means is sometimes called “soft k-means.” 


(24.14) 


cy = 


24.5 BAYESIAN REASONING 


The maximum likelihood estimator follows a frequentist approach. This means that 
we refer to the parameter 6 as a fixed parameter and the only problem is that we do 
not know its value. A different approach to parameter estimation is called Bayesian 
reasoning. In the Bayesian approach, our uncertainty about 6 is also modeled using 
probability theory. That is, we think of 6 as arandom variable as well and refer to the 
distribution P[6] as a prior distribution. As its name indicates, the prior distribution 
should be defined by the learner prior to observing the data. 
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As an example, let us consider again the drug company which developed a new 
drug. On the basis of past experience, the statisticians at the drug company believe 
that whenever a drug has reached the level of clinic experiments on people, it is 
likely to be effective. They model this prior belief by defining a density distribution 
on 6 such that 


0.8 if6>0.5 
Plel= : he (24.15) 
0.2 if6<0.5 


As before, given a specific value of 0, it is assumed that the conditional probability, 
P[X =x|@], is known. In the drug company example, X takes values in {0,1} and 
P[X =x|0] =0*7(1-0)'*. 

Once the prior distribution over 6 and the conditional distribution over X given 
6 are defined, we again have complete knowledge of the distribution over X. This is 
because we can write the probability over X as a marginal probability 


P[X=x]=) > P[X=x,6]= 5 — PlO|P[X =x/6], 
6 6 


where the last equality follows from the definition of conditional probability. If 6 
is continuous we replace P[0] with the density function and the sum becomes an 
integral: 


PIX =x]= [Preirtx = x|6]d0. 


Seemingly, once we know P[X = x], a training set S = (x1,..., Xm) tells us nothing 
as we are already experts who know the distribution over a new point X. However, 
the Bayesian view introduces dependency between S and X. This is because we 
now refer to 6 as a random variable. A new point X and the previous points in S are 
independent only conditioned on @. This is different from the frequentist philosophy 
in which 6 is a parameter that we might not know, but since it is just a parameter of 
the distribution, a new point X and previous points S are always independent. 

In the Bayesian framework, since X and S are not independent anymore, what 
we would like to calculate is the probability of X given S, which by the chain rule 
can be written as follows: 


P[X =x|S]= >_> P[X =x|9, S]P[A|S]= 5 P[X = x6] [9/5]. 
6 6 


The second inequality follows from the assumption that X and S are independent 
when we condition on 6. Using the Bayes rule we have 


P[S|@] PIA] 


PlIS|= Sr 


and together with the assumption that points are independent conditioned on 6, we 
can write 
P[S|0|P[A] 1 


Pes FIs] PSI 


] [7 =«:16)P fel. 
i=1 
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We therefore obtain the following expression for Bayesian prediction: 
1 m 
PIX = 218] = ory DPX = 16] | [PX = 2116] P18). (24.16) 
6 i=1 


Getting back to our drug company example, we can rewrite P[X = x|S] as 


1 
P[X =x|S]= Fis former g)l-*4+ Xi I-¥) Pig] dé. 


It is interesting to note that when P[6] is uniform we obtain that 


P[x=xis}o for=v( gyi thi dx) gg, 


Solving the preceding integral (using integration by parts) we obtain 


Recall that the prediction according to the maximum likelihood principle in this 
case is P[X = 1/4] = Dik The Bayesian prediction with uniform prior is rather 
similar to the maximum likelihood prediction, except it adds “pseudoexamples” to 
the training set, thus biasing the prediction toward the uniform prior. 


Maximum A Posteriori 

In many situations, it is difficult to find a closed form solution to the integral given 
in Equation (24.16). Several numerical methods can be used to approximate this 
integral. Another popular solution is to find a single 6 which maximizes P[6|S]. 
The value of @ which maximizes P[0|S] is called the Maximum A Posteriori estima- 
tor. Once this value is found, we can calculate the probability that X = x given the 
maximum a posteriori estimator and independently on S. 


24.6 SUMMARY 


In the generative approach to machine learning we aim at modeling the distribution 
over the data. In particular, in parametric density estimation we further assume that 
the underlying distribution over the data has a specific parametric form and our goal 
is to estimate the parameters of the model. We have described several principles 
for parameter estimation, including maximum likelihood, Bayesian estimation, and 
maximum a posteriori. We have also described several specific algorithms for imple- 
menting the maximum likelihood under different assumptions on the underlying 
data distribution, in particular, Naive Bayes, LDA, and EM. 


24.7 BIBLIOGRAPHIC REMARKS 


The maximum likelihood principle was studied by Ronald Fisher in the beginning 
of the 20th century. Bayesian statistics follow the Bayes rule, which is named after 
the 18th century English mathematician Thomas Bayes. 
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There are many excellent books on the generative and Bayesian approaches 
to machine learning. See, for example, (Bishop 2006, Koller & Friedman 20095, 
MacKay 2003, Murphy 2012, Barber 2012). 


24.8 EXERCISES 


24.1 Prove that the maximum likelihood estimator of the variance of a Gaussian variable 
is biased. 

24.2 Regularization for Maximum Likelihood: Consider the following regularized loss 
minimization: 


ig 1 
i Slog (1/Pa[xi]) + on (log (1/0) + log(1/(1 —@))) 
i=l 
® Show that the preceding objective is equivalent to the usual empirical error 


had we added two pseudoexamples to the training set. Conclude that the 
regularized maximum likelihood estimator would be 


P 1 ” 


i=1 


™ Derive a high probability bound on |@ — 6*|. Hint: Rewrite this as |6 — E[6]+ 
i [6] —6*| and then use the triangle inequality and Hoeffding inequality. 

® Use this to bound the true risk. Hint: Use the fact that now 6 > oh to relate 
6 — 6*| to the relative entropy. 

24.3 Consider a general optimization problem of the form 


k 
max) > vylog(cy) s.t. cy >0, Socy =1, 


y=1 y 


where ve€ Rt is a vector of nonnegative weights. 
™ Verify that the M step of soft k-means involves solving such an optimization 


® Lete*= = = »- Show that ¢e* is a probability vector. 
™ Show that the optimization problem is equivalent to the problem 


min Dre (c"||¢) s.t. 20, gat 


y 


® Using properties of the relative entropy, conclude that ¢c* is the solution to the 
optimization problem. 
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Feature Selection and Generation 


In the beginning of the book, we discussed the abstract model of learning, in which 
the prior knowledge utilized by the learner is fully encoded by the choice of the 
hypothesis class. However, there is another modeling choice, which we have so far 
ignored: How do we represent the instance space 1? For example, in the papayas 
learning problem, we proposed the hypothesis class of rectangles in the smoothness- 
color two dimensional plane. That is, our first modeling choice was to represent a 
papaya as a two dimensional point corresponding to its smoothness and color. Only 
after that did we choose the hypothesis class of rectangles as a class of mappings 
from the plane into the label set. The transformation from the real world object 
“papaya” into the scalar representing its smoothness or its color is called a feature 
function or a feature for short; namely, any measurement of the real world object 
can be regarded as a feature. If V is a subset of a vector space, each x € ¥ is some- 
times referred to as a feature vector. It is important to understand that the way we 
encode real world objects as an instance space ¥ is by itself prior knowledge about 
the problem. 

Furthermore, even when we already have an instance space 4 which is repre- 
sented as a subset of a vector space, we might still want to change it into a different 
representation and apply a hypothesis class on top of it. That is, we may define a 
hypothesis class on V by composing some class H on top of a feature function which 
maps ¥ into some other vector space V’. We have already encountered examples 
of such compositions — in Chapter 15 we saw that kernel-based SVM learns a com- 
position of the class of halfspaces over a feature mapping w that maps each original 
instance in V into some Hilbert space. And, indeed, the choice of y is another form 
of prior knowledge we impose on the problem. 

In this chapter we study several methods for constructing a good feature set. We 
start with the problem of feature selection, in which we have a large pool of fea- 
tures and our goal is to select a small number of features that will be used by our 
predictor. Next, we discuss feature manipulations and normalization. These include 
simple transformations that we apply on our original features. Such transforma- 
tions may decrease the sample complexity of our learning algorithm, its bias, or its 
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computational complexity. Last, we discuss several approaches for feature learning. 
In these methods, we try to automate the process of feature construction. 

We emphasize that while there are some common techniques for feature learn- 
ing one may want to try, the No-Free-Lunch theorem implies that there is no 
ultimate feature learner. Any feature learning algorithm might fail on some prob- 
lem. In other words, the success of each feature learner relies (sometimes implicitly) 
on some form of prior assumption on the data distribution. Furthermore, the rela- 
tive quality of features highly depends on the learning algorithm we are later going 
to apply using these features. This is illustrated in the following example. 


Example 25.1. Consider a regression problem in which X = R’, Y = R, and the 
loss function is the squared loss. Suppose that the underlying distribution is such 
that an example (x, y) is generated as follows: First, we sample x; from the uniform 
distribution over [— 1,1]. Then, we deterministically set y = x7. Finally, the second 
feature is set to be x2 = y +z, where z is sampled from the uniform distribution over 
[—0.01, 0.01]. Suppose we would like to choose a single feature. Intuitively, the first 
feature should be preferred over the second feature as the target can be perfectly 
predicted based on the first feature alone, while it cannot be perfectly predicted 
based on the second feature. Indeed, choosing the first feature would be the right 
choice if we are later going to apply polynomial regression of degree at least 2. How- 
ever, if the learner is going to be a linear regressor, then we should prefer the second 
feature over the first one, since the optimal linear predictor based on the first feature 
will have a larger risk than the optimal linear predictor based on the second feature. 


5B 


Gas 


1. FEATURE SELECTION 


Throughout this section we assume that Y = R@. That is, each instance is repre- 
sented as a vector of d features. Our goal is to learn a predictor that only relies 
on k <d features. Predictors that use only a small subset of features require a 
smaller memory footprint and can be applied faster. Furthermore, in applications 
such as medical diagnostics, obtaining each possible “feature” (e.g., test result) can 
be costly; therefore, a predictor that uses only a small number of features is desirable 
even at the cost of a small degradation in performance, relative to a predictor that 
uses more features. Finally, constraining the hypothesis class to use a small subset 
of features can reduce its estimation error and thus prevent overfitting. 

Ideally, we could have tried all subsets of k out of d features and choose the 
subset which leads to the best performing predictor. However, such an exhaustive 
search is usually computationally intractable. In the following we describe three 
computationally feasible approaches for feature selection. While these methods 
cannot guarantee finding the optimal subset, they often work reasonably well in 
practice. Some of the methods come with formal guarantees on the quality of the 
selected subsets under certain assumptions. We do not discuss these guarantees 
here. 


SE 


2 & 
dow 


1.1 Filters 


Maybe the simplest approach for feature selection is the filter method, in which 
we assess individual features, independently of other features, according to some 
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quality measure. We can then select the k features that achieve the highest score 
(alternatively, decide also on the number of features to select according to the value 
of their scores). 

Many quality measures for features have been proposed in the literature. Maybe 
the most straightforward approach is to set the score of a feature according to the 
error rate of a predictor that is trained solely by that feature. 

To illustrate this, consider a linear regression problem with the squared loss. Let 
v= (41,;,---,Xm,j) € R” be a vector designating the values of the jth feature on a 
training set of m examples and let y = (y1,..., ym) € R” be the values of the target on 
the same m examples. The empirical squared loss of an ERM linear predictor that 
uses only the jth feature would be 


a 2 
min —|lav+b—yll-, 
abeRm 


where the meaning of adding a scalar b to a vector v is adding b to all coordinates of 
vy. To solve this problem, let v = ayy u; be the averaged value of the feature and 


let y= ayy y; be the averaged value of the target. Clearly (see Exercise 25.1), 
an uk 2 az ot = =) 1/2 
min —|jav+b—y||\“ = min —lla(v—%v)+b—-(y—y)||’. (25.1) 
a,beRM a,beRM 


Taking the derivative of the right-hand side objective with respect to b and com- 
paring it to zero we obtain that b = 0. Similarly, solving for a (once we know that 
b =0) yields a = (v—v,y — y)/||v — o||?. Plugging this value back into the objective 
we obtain the value 


_2_ v-%y- HP 
ty —py2— SB 

Ilv — ol 

Ranking the features according to the minimal loss they achieve is equivalent to 
ranking them according to the absolute value of the following score (where now a 


higher score yields a better feature): 


(v—,y-5¥) 1 (v—i,y—5) 


v—olllly—yll /a aia) i = 12 
I vil lly — yl > lv — oll? A \ly — 51) 


The preceding expression is known as Pearson’s correlation coefficient. The numer- 
ator is the empirical estimate of the covariance of the jth feature and the target 
value, E[(v —Ev)(y —E y)], while the denominator is the squared root of the empir- 
ical estimate for the variance of the jth feature, E[(v —Ev)’], times the variance of 
the target. Pearson’s coefficient ranges from —1 to 1, where if the Pearson’s coef- 
ficient is either 1 or —1, there is a linear mapping from v to y with zero empirical 
risk. 

If Pearson’s coefficient equals zero it means that the optimal linear function from 
v to y is the all-zeros function, which means that v alone is useless for predicting y. 
However, this does not mean that v is a bad feature, as it might be the case that 
together with other features v can perfectly predict y. Indeed, consider a simple 
example in which the target is generated by the function y = x1 + 2x2. Assume also 
that x1 is generated from the uniform distribution over {+1}, and x2 = =5x1 + 52, 


(25.2) 
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where z is also generated ii. from the uniform distribution over {+1}. Then, 
# [x1] = E[x2] = E[y] =0, and we also have 


[yxi] = E[x{] +2E[xox1] =E[xq] —E[xq] +E [zx1] =0. 


Therefore, for a large enough training set, the first feature is likely to have a 
Pearson’s correlation coefficient that is close to zero, and hence it will most proba- 
bly not be selected. However, no function can predict the target value well without 
knowing the first feature. 

There are many other score functions that can be used by a filter method. 
Notable examples are estimators of the mutual information or the area under the 
receiver operating characteristic (ROC) curve. All of these score functions suffer 
from similar problems to the one illustrated previously. We refer the reader to 
Guyon and Elisseeff (2003). 


25.1.2 Greedy Selection Approaches 


Greedy selection is another popular approach for feature selection. Unlike filter 
methods, greedy selection approaches are coupled with the underlying learning 
algorithm. The simplest instance of greedy selection is forward greedy selection. 
We start with an empty set of features, and then we gradually add one feature at a 
time to the set of selected features. Given that our current set of selected features is 
I, we go over alli ¢ J, and apply the learning algorithm on the set of features / U {i}. 
Each such application yields a different predictor, and we choose to add the feature 
that yields the predictor with the smallest risk (on the training set or on a validation 
set). This process continues until we either select k features, where k is a predefined 
budget of allowed features, or achieve an accurate enough predictor. 


Example 25.2 (Orthogonal Matching Pursuit). To illustrate the forward greedy 
selection approach, we specify it to the problem of linear regression with the squared 
loss. Let X € R”“ be a matrix whose rows are the m training instances. Let y € R” 
be the vector of the m labels. For every i € [d], let X; be the ith column of X. Given 
a set I Cc [d] we denote by X, the matrix whose columns are {X; :i € J}. 

The forward greedy selection method starts with Jy) = J. At iteration t, we look 
for the feature index j;, which is in 


is : 2 
argmin min ||X;,_ ,ujjjw—yll’. 
j weR’ 


Then, we update /; = /;_1 U {jr}. 

We now describe a more efficient implementation of the forward greedy selec- 
tion approach for linear regression which is called Orthogonal Matching Pursuit 
(OMP). The idea is to keep an orthogonal basis of the features aggregated so far. 
Let V; be a matrix whose columns form an orthonormal basis of the columns of X,,. 

Clearly, 

min ||X;,w —y||’ = min ||V,0 —y||’. 
w OER! 


We will maintain a vector 6, which minimizes the right-hand side of the equations. 
Initially, we set Jn = 4, Vo = YW, and 6; to be the empty vector. At round f, for 
every j, we decompose X; =v; +uj; where v; = V;_-1V,! ,X; is the projection of X; 
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onto the subspace spanned by V;_1 and u; is the part of X; orthogonal to V;_1 (see 
Appendix C). Then, 


min |Vi-18 + au; —y (? 


=min f V,-10 —y| 24 ou; ||? +2a(u;,V;-10 -y)| 


= min f V;— @ yl? +07 [uj |? +20(uj,—y)| 


= min {V1 —yl?| + min [o? jw)? —2a(uj,y)| 


= {1IVi-16,-1 —yll?] +:min |e Ilmj I? — 2er(uj,y)| 
((uj.y))? 
= ||Vi-10-1 — yl? —$ So 
en Ti 


It follows that we should select the feature 


yy? 
jr = argmax (uj.y))" : yi) ; 
F \/u; || 


The rest of the update is to set 


Uj (u;,,Y) 
Y= Via 6,= [6.15 i |: 
‘ | "Yu; |? on een Ta 


The OMP procedure maintains an orthonormal basis of the selected features, 
where in the preceding description, the orthonormalization property is obtained by 
a procedure similar to Gram-Schmidt orthonormalization. In practice, the Gram- 
Schmidt procedure is often numerically unstable. In the pseudocode that follows we 
use SVD (see Section C.4) at the end of each round to obtain an orthonormal basis 
in a numerically stable manner. 


Orthogonal Matching Pursuit (OMP) 


input: 
data matrix X € R”, labels vector y € R”, 
budget of features T 

initialize: , = % 

fort =1,...,T 


use SVD to find an orthonormal basis V € R’”'~! of X;, 
(for t = 1 set V to be the all zeros matrix) 
foreach j € [d]\ J, letuj =X;-VV'X; 


((uj.y))? 
uj l>O ju; (2 


let It = argmax j¢7, 
update J,4, = 1, U {j;} 
output [741 
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More Efficient Greedy Selection Criteria 
Let R(w) be the empirical risk of a vector w. At each round of the forward greedy 
selection method, and for every possible j, we should minimize R(w) over the 
vectors w whose support is /;_; U{j}. This might be time consuming. 

A simpler approach is to choose j; that minimizes 


argmin min R(w,_1 + 7e;), 
j neR 


where e; is the all zeros vector except 1 in the jth element. That is, we keep the 
weights of the previously chosen coordinates intact and only optimize over the new 
variable. Therefore, for each j we need to solve an optimization problem over a 
single variable, which is a much easier task than optimizing over f. 

An even simpler approach is to upper bound R(w) using a “simple” function and 
then choose the feature which leads to the largest decrease in this upper bound. For 
example, if R is a B-smooth function (see Equation (12.5) in Chapter 12), then 

Rw) 


0 
R(w+nej) < R(w) + ee + Br’ /2. 
Wj 


Minimizing the right-hand side over 7 yields n = — a . 3 and plugging this value 
into the inequality yields 
1 /aR(w)\? 
R =— 3 
) 2B ( dw; 


This value is minimized if the partial derivative of R(w) with respect to w; is max- 
imal. We can therefore choose j,; to be the index of the largest coordinate of the 
gradient of R(w) at w. 


Remark 25.3 (AdaBoost as a Forward Greedy Selection Procedure). It is possible 
to interpret the AdaBoost algorithm from Chapter 10 as a forward greedy selection 
procedure with respect to the function 


m d 
R(w) = log | S\exp | —y; 5) wjhj(x;) | J. (25.3) 
i=] 


j=l 


See Exercise 25.3. 


Backward Elimination 
Another popular greedy selection approach is backward elimination. Here, we start 
with the full set of features, and then we gradually remove one feature at a time 
from the set of features. Given that our current set of selected features is J, we go 
over alli € J, and apply the learning algorithm on the set of features / \ {i}. Each 
such application yields a different predictor, and we choose to remove the feature 
i for which the predictor obtained from / \ {i} has the smallest risk (on the training 
set or on a validation set). 

Naturally, there are many possible variants of the backward elimination idea. It 
is also possible to combine forward and backward greedy steps. 
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25.1.3 Sparsity-Inducing Norms 


The problem of minimizing the empirical risk subject to a budget of k features can 
be written as 


minLs(w) s.t. ||wllo <k, 
WwW 


where! 
IWhlo = I{t : wi A O}]. 


In other words, we want w to be sparse, which implies that we only need to measure 
the features corresponding to nonzero elements of w. 

Solving this optimization problem is computationally hard (Natarajan 1995, 
Davis, Mallat & Avellaneda 1997). A possible relaxation is to replace the nonconvex 
function ||w||p with the 2; norm, ||w||; = aa |w;|, and to solve the problem 


minLs(w) s.t. ||wili <1, (25.4) 
Ww 


where kj; is a parameter. Since the €; norm is a convex function, this problem can 
be solved efficiently as long as the loss function is convex. A related problem is 
minimizing the sum of Ls(w) plus an £; norm regularization term, 


min (Ls(w) + Allwll1) (25.5) 


where A is a regularization parameter. Since for any k; there exists a A such that 
Equation (25.4) and Equation (25.5) lead to the same solution, the two problems 
are in some sense equivalent. 

The €; regularization often induces sparse solutions. To illustrate this, let us start 
with the simple optimization problem 


in (tw? — 
min (Su xw+Alwl) ‘ (25.6) 


It is easy to verify (see Exercise 25.2) that the solution to this problem is the “soft 
thresholding” operator 
w = sign(x)[|x|— A], (25.7) 


where [a], = max{a, 0}. That is, as long as the absolute value of x is smaller than A, 
the optimal solution will be zero. 
Next, consider a one dimensional regression problem with respect to the squared 


loss: 
argmin : x i)? + Aw! 
g om Xjw— Yi wl]. 


mn 
weR i= 


We can rewrite the problem as 


; 1 m 
argmin 5 tS ox} we — tS oxy wtrlw| |]. 
weR” i i=1 


! The function || - ||o is often referred to as the £9 norm. Despite the use of the “norm” notation, || - |lo 
is not really a norm; for example, it does not satisfy the positive homogeneity property of norms, 
llawllo lal |lwilo. 
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m 


For simplicity let us assume that + >; x? = 1, and denote (x, y) = 7”, xi yi; then the 
optimal solution is 

w = sign((x, y))[I(x,y)|/m—Al],. 
That is, the solution will be zero unless the correlation between the feature x and 
the labels vector y is larger than i. 


Remark 25.4. Unlike the @; norm, the £2 norm does not induce sparse solutions. 
Indeed, consider aforementioned problem with an @ regularization, namely, 


; 1 m 4 
argmin (sy ow—w +hu* }. 


we R™ 
Then, the optimal solution is 


= (x, y)/m 
I|x||2/m + 22° 


This solution will be nonzero even if the correlation between x and y is very small. In 
contrast, as we have shown before, when using ¢; regularization, w will be nonzero 
only if the correlation between x and y is larger than the regularization parameter i. 


Adding ¢; regularization to a linear regression problem with the squared loss 
yields the LASSO algorithm, defined as 


. 1 
argmin (Six — yl? alm) . (25.8) 
W m 


Under some assumptions on the distribution and the regularization parameter A, 
the LASSO will find sparse solutions (see, for example, (Zhao & Yu 2006) and 
the references therein). Another advantage of the ¢; norm is that a vector with 
low £; norm can be “sparsified” (see, for example, (Shalev-Shwartz, Zhang, and 
Srebro 2010) and the references therein). 


25.2 FEATURE MANIPULATION AND NORMALIZATION 


Feature manipulations or normalization include simple transformations that we 
apply on each of our original features. Such transformations may decrease the 
approximation or estimation errors of our hypothesis class or can yield a faster algo- 
rithm. Similarly to the problem of feature selection, here again there are no absolute 
“good” and “bad” transformations, but rather each transformation that we apply 
should be related to the learning algorithm we are going to apply on the resulting 
feature vector as well as to our prior assumptions on the problem. 

To motivate normalization, consider a linear regression problem with the 
squared loss. Let X ¢ R”“ be a matrix whose rows are the instance vectors and let 
y € R” be a vector of target values. Recall that ridge regression returns the vector 


1 
argmin | —||Xw—yl|° +Allwll?| = (2am +X'X) 1 X'y. 
Ww m 
Suppose that d =2 and the underlying data distribution is as follows. First we sample 


y uniformly at random from {+1}. Then, we set x1 to be y+0.5a, where a is sampled 
uniformly at random from {+1}, and we set x2 to be 0.0001y. Note that the optimal 
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weight vector is w* = [0;10000], and Lp(w*) = 0. However, the objective of ridge 
regression at w* is 410°. In contrast, the objective of ridge regression at w = [1;0] is 
likely to be close to 0.25 +A. It follows that whenever A > = ~ 0.25 x 1078, the 
objective of ridge regression is smaller at the suboptimal solution w = [1;0]. Since 
i typically should be at least 1/m (see the analysis in Chapter 13), it follows that in 
the aforementioned example, if the number of examples is smaller than 10° then we 
are likely to output a suboptimal solution. 

The crux of the preceding example is that the two features have completely dif- 
ferent scales. Feature normalization can overcome this problem. There are many 
ways to perform feature normalization, and one of the simplest approaches is simply 
to make sure that each feature receives values between —1 and 1. In the preceding 
example, if we divide each feature by the maximal value it attains we will obtain that 
x= saa and x7 = y. Then, for 4 < 107? the solution of ridge regression is quite 
close to w*. 

Moreover, the generalization bounds we have derived in Chapter 13 for regu- 
larized loss minimization depend on the norm of the optimal vector w* and on the 
maximal norm of the instance vectors.* Therefore, in the aforementioned example, 
before we normalize the features we have that ||w*||? = 108, while after we normal- 
ize the features we have that ||w*||> = 1. The maximal norm of the instance vector 
remains roughly the same; hence the normalization greatly improves the estimation 
error. 

Feature normalization can also improve the runtime of the learning algorithm. 
For example, in Section 14.5.3 we have shown how to use the Stochastic Gradient 
Descent (SGD) optimization algorithm for solving the regularized loss minimiza- 
tion problem. The number of iterations required by SGD to converge also depends 
on the norm of w* and on the maximal norm of ||x||. Therefore, as before, using 
normalization can greatly decrease the runtime of SGD. 

Next, we demonstrate in the following how a simple transformation on features, 
such as clipping, can sometime decrease the approximation error of our hypothesis 
class. Consider again linear regression with the squared loss. Let a > 1 be a large 
number, suppose that the target y is chosen uniformly at random from {+1}, and 
then the single feature x is set to be y with probability (1 — 1/a) and set to be ay 
with probability 1/a. That is, most of the time our feature is bounded but with a very 
small probability it gets a very high value. Then, for any w, the expected squared 
loss of w is 


Lo(w) =E5(wx— 


1\1 11 
=(1--—]}<(wy—-y)+-<(awy-y). 
aj] 2 a2 


2 More precisely, the bounds we derived in Chapter 13 for regularized loss minimization depend on 
||w* ||? and on either the Lipschitzness or the smoothness of the loss function. For linear predictors 
and loss functions of the form ¢(w, (x, y)) = @((w, x), y), where @ is convex and either 1-Lipschitz or 
1-smooth with respect to its first argument, we have that £ is either ||x||-Lipschitz or ||x||?-smooth. For 
example, for the squared loss, ¢(a, y) = s(a —y)*, and £(w, (x, y)) = 3((w, x) — y)? is ||x||?-smooth with 
respect to its first argument. 
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Solving for w we obtain that w* = 4! which goes to zero as a goes to infinity. 
a*+a—1 


Therefore, the objective at w* goes to 0.5 as a goes to infinity. For example, for 
a= 100 we will obtain Lp(w”*) > 0.48. Next, suppose we apply a “clipping” transfor- 
mation; that is, we use the transformation x +> sign(x)min{1, |x|}. Then, following 
this transformation, w* becomes 1 and Lp(w*) = 0. This simple example shows that 
a simple transformation can have a significant influence on the approximation error. 
Of course, it is not hard to think of examples in which the same feature trans- 
formation actually hurts performance and increases the approximation error. This 
is not surprising, as we have already argued that feature transformations should rely 
on our prior assumptions on the problem. In the aforementioned example, a prior 
assumption that may lead us to use the “clipping” transformation is that features 
that get values larger than a predefined threshold value give us no additional useful 
information, and therefore we can clip them to the predefined threshold. 


25.2.1 Examples of Feature Transformations 


We now list several common techniques for feature transformations. Usually, it is 
helpful to combine some of these transformations (e.g., centering + scaling). In the 
following, we denote by f= (fi,..., fm) € R” the value of the feature f over the 
m training examples. Also, we denote by f = Se oha ; the empirical mean of the 
feature over all examples. 


Centering: 
This transformation makes the feature have zero mean, by setting f; — f; — f. 


Unit Range: 
This transformation makes the range of each feature be [0,1]. Formally, let fax = 
max; f; and fmin = min; f;. Then, we set fj; <— i= fin Similarly, we can make 


Jmax~ J mi 


the range of each feature be [— 1,1] by the transformation f; <— 2 —1. Of 


course, it is easy to make the range [0,b] or [ — b,b], where b is a user-specified 
parameter. 


Standardization: 
This transformation makes all features have a zero mean and unit variance. For- 


mally, let v = + "1 (f — f)* be the empirical variance of the feature. Then, we 
set fi <— if , 
Clipping: 


This transformation clips high or low values of the feature. For example, fj; < 
sign( f;) max{b, | f;|}, where b is a user-specified parameter. 


Sigmoidal Transformation: 

As its name indicates, this transformation applies a sigmoid function on the fea- 
1 ; : ; 

ture. For example, fi <— Trexp(hhy’ where b is a user-specified parameter. This 

transformation can be thought of as a “soft” version of clipping: It has a small effect 
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on values close to zero and behaves similarly to clipping on values far away from 
Zero. 


Logarithmic Transformation: 

The transformation is f; <log(b+ f;), where b is a user-specified parameter. This is 
widely used when the feature is a “counting” feature. For example, suppose that the 
feature represents the number of appearances of a certain word in a text document. 
Then, the difference between zero occurrences of the word and a single occurrence 
is much more important than the difference between 1000 occurrences and 1001 
occurrences. 


Remark 25.5. In the aforementioned transformations, each feature is transformed 
on the basis of the values it obtains on the training set, independently of other 
features’ values. In some situations we would like to set the parameter of the 
transformation on the basis of other features as well. A notable example is a trans- 
formation in which one applies a scaling to the features so that the empirical average 
of some norm of the instances becomes 1. 


25.3 FEATURE LEARNING 


So far we have discussed feature selection and manipulations. In these cases, we 
start with a predefined vector space R?, representing our features. Then, we select a 
subset of features (feature selection) or transform individual features (feature trans- 
formation). In this section we describe feature learning, in which we start with some 
instance space, V, and would like to learn a function, y : ¥ > R¢, which maps 
instances in ¥ into a representation as d-dimensional feature vectors. 

The idea of feature learning is to automate the process of finding a good rep- 
resentation of the input space. As mentioned before, the No-Free-Lunch theorem 
tells us that we must incorporate some prior knowledge on the data distribution in 
order to build a good feature representation. In this section we present a few feature 
learning approaches and demonstrate conditions on the underlying data distribution 
in which these methods can be useful. 

Throughout the book we have already seen several useful feature constructions. 
For example, in the context of polynomial regression, we have mapped the orig- 
inal instances into the vector space of all their monomials (see Section 9.2.2 in 
Chapter 9). After performing this mapping, we trained a linear predictor on top 
of the constructed features. Automation of this process would be to learn a trans- 
formation y : X — R?, such that the composition of the class of linear predictors on 
top of y yields a good hypothesis class for the task at hand. 

In the following we describe a technique of feature construction called dictionary 
learning. 


25.3.1 Dictionary Learning Using Auto-Encoders 


The motivation of dictionary learning stems from a commonly used representation 
of documents as a “bag-of-words”: Given a dictionary of words D = {w1,..., wx}, 
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where each w; is a string representing a word in the dictionary, and given a docu- 
ment, (p1,-..., Ppa), Where each p; is a word in the document, we represent the 
document as a vector x € {0,1}*, where x; is 1 if w; = pj; for some j € [d], and 
x; = 0 otherwise. It was empirically observed in many text processing tasks that lin- 
ear predictors are quite powerful when applied on this representation. Intuitively, 
we can think of each word as a feature that measures some aspect of the docu- 
ment. Given labeled examples (e.g., topics of the documents), a learning algorithm 
searches for a linear predictor that weights these features so that a right combination 
of appearances of words is indicative of the label. 

While in text processing there is a natural meaning to words and to the dictio- 
nary, in other applications we do not have such an intuitive representation of an 
instance. For example, consider the computer vision application of object recogni- 
tion. Here, the instance is an image and the goal is to recognize which object appears 
in the image. Applying a linear predictor on the pixel-based representation of the 
image does not yield a good classifier. What we would like to have is a mapping 
w that would take the pixel-based representation of the image and would output a 
bag of “visual words,” representing the content of the image. For example, a “visual 
word” can be “there is an eye in the image.” If we had such representation, we could 
have applied a linear predictor on top of this representation to train a classifier for, 
say, face recognition. Our question is, therefore, How can we learn a dictionary 
of “visual words” such that a bag-of-words representation of an image would be 
helpful for predicting which object appears in the image? 

A first naive approach for dictionary learning relies on a clustering algorithm 
(see Chapter 22). Suppose that we learn a function c: ¥ > {1,...,k}, where c(x) 
is the cluster to which x belongs. Then, we can think of the clusters as “words,” 
and of instances as “documents,” where a document x is mapped to the vector 
w(x) € {0,1}*, where w(x); is 1 if and only if x belongs to the ith cluster. Now, it 
is straightforward to see that applying a linear predictor on w(x) is equivalent to 
assigning the same target value to all instances that belong to the same cluster. Fur- 
thermore, if the clustering is based on distances from a class center (e.g., k-means), 
then a linear predictor on w(x) yields a piece-wise constant predictor on x. 

Both the k-means and PCA approaches can be regarded as special cases of a 
more general approach for dictionary learning which is called auto-encoders. In an 
auto-encoder we learn a pair of functions: an “encoder” function, ~ : R4 > R*, and 
a “decoder” function, ¢ : R‘ — R¢. The goal of the learning process is to find a pair 
of functions such that the reconstruction error, >; ||x; — ¢(w(x;))|’, is small. Of 
course, we can trivially set k = d and both yw, ¢ to be the identity mapping, which 
yields a perfect reconstruction. We therefore must restrict y and ¢ in some way. In 
PCA, we constrain k < d and further restrict y and ¢ to be linear functions. In k- 
means, k is not restricted to be smaller than d, but now w and ¢ rely on k centroids, 
fty,---, 4%, and w(x) returns an indicator vector in {0, 1}* that indicates the closest 
centroid to x, while ¢ takes as input an indicator vector and returns the centroid 
representing this vector. 

An important property of the k-means construction, which is key in allowing 
k to be larger than d, is that yw maps instances into sparse vectors. In fact, in k- 
means only a single coordinate of (x) is nonzero. An immediate extension of the 
k-means construction is therefore to restrict the range of y to be vectors with at 


www.EngineeringBooksLibrary.com 


25.5 Bibliographic Remarks 


most s nonzero elements, where s is a small integer. In particular, let y and ¢ be 
functions that depend on py,..., 4,. The function w maps an instance vector x toa 
vector y(x) € R*, where w(x) should have at most s nonzero elements. The function 
¢(v) is defined to be Y 4 uj; As before, our goal is to have a small reconstruction 
error, and therefore we can define 


w(x) = argmin ||x — ¢(v)|" s.t. |Ivllo <3, 


where ||v|lo = |{j : vj 4 O}|. Note that when s = 1 and we further restrict ||v||; = 1 
then we obtain the k-means encoding function; that is, w(x) is the indicator vector 
of the centroid closest to x. For larger values of s, the optimization problem in the 
preceding definition of y becomes computationally difficult. Therefore, in practice, 
we sometime use ¢; regularization instead of the sparsity constraint and define w 
to be 


W(x) = argmin [|x o(v)|? + Allviht} 


where A > 0 is a regularization parameter. Anyway, the dictionary learning problem 
is now to find the vectors ,,...,#, such that the reconstruction error, )77", ||xi — 
6(w(x))||?, is as small as possible. Even if w is defined using the ¢; regularization, 
this is still a computationally hard problem (similar to the k-means problem). How- 
ever, several heuristic search algorithms may give reasonably good solutions. These 
algorithms are beyond the scope of this book. 


25.4 SUMMARY 


Many machine learning algorithms take the feature representation of instances for 
granted. Yet the choice of representation requires careful attention. We discussed 
approaches for feature selection, introducing filters, greedy selection algorithms, 
and sparsity-inducing norms. Next we presented several examples for feature trans- 
formations and demonstrated their usefulness. Last, we discussed feature learn- 
ing, and in particular dictionary learning. We have shown that feature selection, 
manipulation, and learning all depend on some prior knowledge on the data. 


25.5 BIBLIOGRAPHIC REMARKS 


Guyon and Elisseeff (2003) surveyed several feature selection procedures, including 
many types of filters. 

Forward greedy selection procedures for minimizing a convex objective sub- 
ject to a polyhedron constraint date back to the Frank-Wolfe algorithm (Frank & 
Wolfe 1956). The relation to boosting has been studied by several authors, including, 
(Warmuth, Liao & Ratsch 2006, Warmuth, Glocer & Vishwanathan 2008, Shalev- 
Shwartz & Singer 2008). Matching pursuit has been studied in the signal processing 
community (Mallat & Zhang 1993). Several papers analyzed greedy selection meth- 
ods under various conditions. See, for example, Shalev-Shwartz, Zhang, and Srebro 
(2010) and the references therein. 

The use of the £;-norm as a surrogate for sparsity has a long history 
(e.g., Tibshirani (1996) and the references therein), and much work has been done 
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on understanding the relationship between the ¢1-norm and sparsity. It is also 
closely related to compressed sensing (see Chapter 23). The ability to sparsify low 
£, norm predictors dates back to Maurey (Pisier 1980-1981). In Section 26.4 we also 
show that low £; norm can be used to bound the estimation error of our predictor. 

Feature learning and dictionary learning have been extensively studied recently 
in the context of deep neural networks. See, for example, (LeCun & Bengio 1995, 
Hinton et al. 2006, Ranzato et al. 2007, Collobert & Weston 2008, Lee et al. 2009, Le 
et al. 2012, Bengio 2009) and the references therein. 


25.6 EXERCISES 


25.1 Prove the equality given in Equation (25.1). Hint: Let a*, b* be minimizers of the 
left-hand side. Find a,b such that the objective value of the right-hand side is 
smaller than that of the left-hand side. Do the same for the other direction. 

25.2 Show that Equation (25.7) is the solution of Equation (25.6). 

25.3, AdaBoost as a Forward Greedy Selection Algorithm: Recall the AdaBoost algo- 
rithm from Chapter 10. In this section we give another interpretation of AdaBoost 
as a forward greedy selection algorithm. 

M®™ Given a set of m instances x1,...,X,,, and a hypothesis class of finite VC 
dimension, show that there exist d and fj,...,4g such that for every h € H 
there exists i € [d] with h;(x;) = A(x;) for every j € [m]. 

M®@ Let R(w) be as defined in Equation (25.3). Given some w, define fy to be the 
function 

d 
fo(-) = So wihi(-). 
i=1 
Let D be the distribution over [m] defined by 


exp (— yi fw(%i)) 
——=7 = 4 


where Z is a normalization factor that ensures that D is a probability vector. 


Show that 
2 D; yihj (x;). 


Furthermore, denoting €; = = D; Wh (x;)4 yi)» Show that 


D; = 


oR) 


aR(w) 


wy 


= 2¢; —1. 
Conclude that if ¢; < 1/2—y then || 2 y/2. 
J 


™ Show that the update of AdaBoost guarantees R(w t+) — R(w) < 
log (./1—4y7). Hint: Use the proof of Theorem 10.2. 
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In Chapter 4 we have shown that uniform convergence is a sufficient condition for 
learnability. In this chapter we study the Rademacher complexity, which measures 
the rate of uniform convergence. We will provide generalization bounds based on 
this measure. 


26.1 THE RADEMACHER COMPLEXITY 


Recall the definition of an €-representative sample from Chapter 4, repeated here 
for convenience. 


Definition 26.1 («€-Representative Sample). A training set S is called 
€-representative (w.r.t. domain Z, hypothesis class H, loss function £, and distri- 
bution D) if 

sup |Lp(h) — Ls(h)| <€. 

heH 


We have shown that if S is an €/2 representative sample then the ERM rule is 
€-consistent, namely, Lp(ERMy(S)) < minnex Lo(h) +e. 
To simplify our notation, let us denote 


def 


FE Con E {ze U(h,z):heH}, 


and given f € F, we define 


Lo(N=E,U@b Ls(N=—>- fe). 
i=1 


We define the representativeness of S with respect to F as the largest gap between 
the true error of a function f and its empirical error, namely, 


Repp(F,S) © sup (Eo(f)— Es(f)) (26.1) 


Now, suppose we would like to estimate the representativeness of S using the 
sample S only. One simple idea is to split S into two disjoint sets, S = S1 U Sy; 
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refer to S; as a validation set and to S2 as a training set. We can then estimate 
the representativeness of S by 


sup (Ls,(f)—Ls,(f)). (26.2) 
SEF 
This can be written more compactly by defining o = (01,...,0m) € {£1}” to be a 
vector such that S; = {z; :o; =1} and S) = {z; :o; = —1}. Then, if we further assume 


that |5,| = |S2| then Equation (26.2) can be rewritten as 
2 m 
= sup © 9; f(zi). (26.3) 
mM feF 4 


The Rademacher complexity measure captures this idea by considering the expec- 
tation of the term appearing in Equation 26.3 with respect to a random choice of o. 
Formally, let F oS be the set of all possible evaluations a function f € F can achieve 
ona sample S, namely, 


FoS={(f(21),..+sf(m)): f € FI. 


Let the variables in o be distributed i.i.d. according to P[o; = 1] = P[o; = —1] = 5. 
Then, the Rademacher complexity of F with respect to S is defined as follows: 


1 
RPS) = m o~{+1}” 


Oj i is 26.4 
2 > f(z | (26.4) 


More generally, given a set of vectors, A C R”, we define 


R(A) py * ) sup So : (26.5) 
1 


acA = 


The following lemma bounds the expected value of the representativeness of S 
by twice the expected Rademacher complexity. 


Lemma 26.2. 


S~Dm [ Repp te S)] < 2 s~pm R(F Oo S). 


Proof. Let S’ = {z},...,z,,} be another i.i.d. sample. Clearly, for all f ¢ F, Lp(f) = 
is’ [Ls'(f)]. Therefore, for every f € F we have 


Lo(f)~ Ls(f) = ElEs(fI-Es(f) = ElEs(f)-L5(P)] 


Taking supremum over f € ¥ of both sides, and using the fact that the supremum 
of expectation is smaller than expectation of the supremum we obtain 


sup (Lo(f)—Ls(f)) = sup E[Ls(f)—Ls(f)] 
SEF s 


fEF 


IA 


i | sup (Ly(f)—Ls(f)) | . 
s fEF 
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Taking expectation over S on both sides we obtain 


IA 


E sup (Lp(f)— a) s, sup (Ly(f)— | 
fEF S,S SEF 


Slr 


eu pa Xf Gi)-f 2) (26.6) 


Next, we note that for each j, z; and zi are i.i.d. variables. Therefore, we can replace 
them without affecting the expectation: 


E op Ga fe))+> F@)- icn)| 


ifj 


S,S! 
iAj 


=E op [ver fe +S (F)- sc))| (26.7) 


Let oj; be a random variable such that P[o; = 1] = P[oj; = —1] = 1/2. From 
Equation (26.7) we obtain that 


se, | sup [ous - f+ E)- 7) | 


iAj 


1 1 
= 3 (Lbs. of Equation (26.7)) + 5 (ths. of Equation (26.7)) 


5,5! 
iAj 


= a [ve FE) +> FR) — se) | (26.8) 


Repeating this for all j we obtain that 


SS S,5',6 


y 'y— i = Oj ‘ = i . 26.9 
| sue re] > (F(z) reo] (26.9) 


Finally, 
sae ie) — f(zi))< sp pealhe) 2 —o; f (zi) 


and since the probability of o is the same as the probability of —o, the right-hand 
side of Equation (26.9) can be bounded by 


sz, sup Do Fei) + sp Soa 


= mE[R(F 0 S’)] +m E[R(F 0 S)| =2mE[R(F 0 S)]. 


oO 


The lemma immediately yields that, in expectation, the ERM rule finds a 
hypothesis which is close to the optimal hypothesis in H. 
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Theorem 26.3. We have 


jen LED(ERMx(S)) — Ls(ERMx(S))] <2 E R(Lo HOS). 


Furthermore, for any h* €H. 


(Ey [LD(ERMn(5)) — Lo(h")] < 2 E, REoH0 5). 


Furthermore, if h* = argmin, Lp(h) then for each 5 € (0,1) with probability of at least 
1 —6 over the choice of S we have 


2 Esiwpm R(LoH oS’) 
—— es 

Proof. The first inequality follows directly from Lemma 26.2. The second inequality 
follows because for any fixed h’*, 


Lo(h*) =E[Ls(h*)] = E[Ls(ERMy(S))] 


Lp(ERMy,(S))— Lp(h*) < 


The third inequality follows from the previous inequality by relying on Markov’s 
inequality (note that the random variable Lp(ERM7(S)) — Lp(h*) is nonnegative). 
O 


Next, we derive bounds similar to the bounds in Theorem 26.3 with a bet- 
ter dependence on the confidence parameter 5. To do so, we first introduce the 
following bounded differences concentration inequality. 


Lemma 26.4 (McDiarmid’s Inequality). Let V be some set and let f: V" > R 
be a function of m variables such that for some c > 0, for all i € [m] and for all 
X1,++-,Xm, x; € V we have 


LF tees tel F hes se Meta eed <c. 


Let X1,...,Xm be m independent random variables taking values in V. Then, with 
probability of at least 1 — 6 we have 


Lf (X1.--+ Xm) -ELf(X1, + Xd] S ¢4/ In (F) m/2. 


On the basis of the McDiarmid inequality we can derive generalization bounds 
with a better dependence on the confidence parameter. 


Theorem 26.5. Assume that for all z and h €H we have that |€(h, z)| < c. Then, 
1. With probability of at least 1 — 4, for allh € H, 


[21n (2/6 
1 R(LoHOS')+ce en 
/~ pm m 


In particular, this holds for h =ERM},(S). 
2. With probability of at least 1 — 6, for allh €H, 


Lp(h)—Ls(h) < 2R(oHo 5) +4e/- 2G) 


In particular, this holds for h =ERM},(S). 


& 


Lo(ht)—Ls(h) = 2, 
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3. For any h*, with probability of at least 1 — 6, 


Lp(ERMx(S))—Lp(h*) < 2R(LoH0S)+5ey)-—— 


Proof. First note that the random variable Repp(F, S) = sup;<y, (Lo(h) — Ls(h)) 
satisfies the bounded differences condition of Lemma 26.4 with a constant 2c/m. 
Combining the bounds in Lemma 26.4 with Lemma 26.2 we obtain that with 
probability of at least 1-6, 


2in(2/s 2in(2/s 
Repp(F,S) < :Repp(F,S) +e] C) < oR RUoHos") +e Me). 
m Z m 


The first inequality of the theorem follows from the definition of Rep,(F, S). For 
the second inequality we note that the random variable R(£0 Ho S) also satisfies 
the bounded differences condition of Lemma 26.4 with a constant 2c/m. Therefore, 
the second inequality follows from the first inequality, Lemma 26.4, and the union 
bound. Finally, for the last inequality, denote hs = ERM7(S) and note that 


Lp(hs) — Lo(h*) 
= Lo(hs)— Ls(hs) + Ls(hs) — Ls(h*) + Ls(h*) — Lo(h*) 
< (Lp(hs) — Ls(hs)) + (Ls(h*) — Lp(h*)) . (26.10) 


The first summand on the right-hand side is bounded by the second inequality of 
the theorem. For the second summand, we use the fact that h* does not depend on 
S; hence by using Hoeffding’s inequality we obtain that with probaility of at least 
1-6/2, 


In(4/6 
Lia eeo™, (26.11) 
2m 
Combining this with the union bound we conclude our proof. O 


The preceding theorem tells us that if the quantity R(€ 0H o S) is small then it 
is possible to learn the class 1 using the ERM rule. It is important to emphasize 
that the last two bounds given in the theorem depend on the specific training set S. 
That is, we use S both for learning a hypothesis from H as well as for estimating the 
quality of it. This type of bound is called a data-dependent bound. 


26.1.1 Rademacher Calculus 


5 


Let us now discuss some properties of the Rademacher complexity measure. These 
properties will help us in deriving some simple bounds on R(€0H 0 S) for specific 
cases of interest. 

The following lemma is immediate from the definition. 


Lemma 26.6. For any A C R", scalar c € R, and vector ay € R”, we have 
R({ca+ao:a€ A}) < |c| R(A). 


The following lemma tells us that the convex hull of A has the same complexity 
as A. 
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Lemma 26.7. Let A be a subset of RX" and let A! = {379 ,aja) : NE N,Vj,a”) € 
A, a; = 0, |la||1 =1}. Then, R(A’) = R(A). 
Proof. The main idea follows from the fact that for any vector v we have 


N 
sup ; av = maxvj;. 
o20:llerl|1=1 5—y j 


Therefore, 
m N ; 
mR(A’)=E sup sup Soo; So ajal? 
7 w>0:\oly=lall),...a(M) Gay j=l 
N m ; 
=E sup Soa; sup )_o;a\"” 
* «20:\Jqr||1=1 54 ali) 7-4 
m 
=Esup Oj; 
FT acd > 
=m R(A), 
and we conclude our proof. O 


The next lemma, due to Massart, states that the Rademacher complexity of a 
finite set grows logarithmically with the size of the set. 


Lemma 26.8 (Massart Lemma). Let A = {a1,...,ay} be a finite set of vectors in R”. 
Define a= x So, ai. Then, 


/2 log (N) 


R(A) < max|la—al| 
acA 


Proof. On the basis of Lemma 26.6, we can assume without loss of generality that 
a= 0. Let A > 0 and let A’ = {Aaj,...,4ay}. We upper bound the Rademacher 
complexity as follows: 


mR(A')=E [maxio.a)| = E og (maxe'**")| 


o | acA’ acA’ 


7 [le » a 
. acA’ 


< log ( o pp ce // Jensen’s inequality 
oO 


acA’ 


= log (= [l= ) ; 


acA/i=1 


where the last equality occurs because the Rademacher variables are independent. 
Next, using Lemma A.6 we have that for all a; € R, 


5 oot; _ XP (a;) + exp(— aj) 


| ; < exp(a?/2), 
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and therefore 


mR(A’) < log (x Ilexp (+)) = log & exp (i) 


acA/i=1 acA’ 
< log (141maxexp (1a1? 2) = log (|A’|) + max (||a|?/2). 
acA’ ac A’ 


Since R(A) = +R(A’) we obtain from the equation that 


log (|A|) +A? maxaca ([lall?/2) 


R(A) < 

(A) Am 

Setting A = \/2 log(|A|)/maxaca ||al|?_ and rearranging terms we conclude our 
proof. O 


The following lemma shows that composing A with a Lipschitz function does not 
blow up the Rademacher complexity. The proof is due to Kakade and Tewari. 


Lemma 26.9 (Contraction Lemma). For each i € [m], let 6; :]R > R be a p-Lipschitz 
function; namely, for all a, B € R we have |¢;(a«) — ¢;(B)| < pla — B|. Fora € R” let 
(a) denote the vector (¢1(a1),...,¢m(¥m)). Let 60 A = {b(a): a € A}. Then, 


R(go A) <p R(A). 


Proof. For simplicity, we prove the lemma for the case p = 1. The case p # 
1 will follow by defining ¢’ = io and then using Lemma 26.6. Let A; = 
{(a1,..-,4i-1, bi (4), 4j41,---,4m) 1a € A}. Clearly, it suffices to prove that for any 
set A and alli we have R(A;) < R(A). Without loss of generality we will prove the 
latter claim for i = 1 and to simplify notation we omit the subscript from ¢;. We have 


mR(A\)=E [se So oiai 


acd, i=1 


=  [auparter)+ Soo 
o |aeA 


i=2 


1 m m 
=- 4 sup {| ¢(a,)+ oja; | +sup | —d(a,)+ Ojai; 
1 m m 
=- E sup | d(a1)—¢(a})+ oia; + oja! 
2 O2,+++,0m sw ( 1 d 2 . 
1 m m 
<- E sup | la; —aj|+ So oiai + S cia} , (26.12) 
2 OQ s-005 Om aa’cA i? 72 


where in the last inequality we used the assumption that ¢ is Lipschitz. Next, we 
note that the absolute value on |a; — a}| in the preceding expression can be omitted 
since both a and a’ are from the same set A and the rest of the expression in the 
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supremum is not affected by replacing a and a’. Therefore, 


1 m m 
mR(Ai) < as : su (11-4 + Yom + Sone!) (26.13) 


aenE aja’cA 


But, using the same equalities as in Equation (26.12), it is easy to see that the right- 
hand side of Equation (26.13) exactly equals m R(A), which concludes our proof. O 


26.2 RADEMACHER COMPLEXITY OF LINEAR CLASSES 


In this section we analyze the Rademacher complexity of linear classes. To simplify 
the derivation we first define the following two classes: 


Hy = {xt (w, x): [lwlli <1}, H2= {xb (w,x) : |lwll2 < 1}. (26.14) 


The following lemma bounds the Rademacher complexity of Hz. We allow the 
x; to be vectors in any Hilbert space (even infinite dimensional), and the bound 
does not depend on the dimensionality of the Hilbert space. This property becomes 
useful when analyzing kernel methods. 


Lemma 26.10. Let S = (x1,...,Xm) be vectors in a Hilbert space. Define: Hz 0 S = 
{((w, X1),--, (W,Xm)) + llwll2 < 1}. Then, 


max; ||x;||2 


Vm 


Proof. Using Cauchy-Schwartz inequality we know that for any vectors w,v we 
have (w, v) < ||w]| ||v||. Therefore, 


R(H20S) < 


mR(H20 S) = 


sup se 0; (W, X;) 


| 
La. Seno 
| 
| 


sup ww Sona) 
i=l 


w:||w|| <1 
m 
<E |||) oxi b (26.15) 
i=l 
Next, using Jensen’s inequality we have that 
m m 2 ie m 2 he 
:| So oixi | =E So oixi < ke So oixi ‘ (26.16) 
i=l 2 i=l 2 i=l 2 
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Finally, since the variables 01,...,0 are independent we have 


m 
7 2 
EH > oixill3 
i=1 


y 0,0; (X;,X; 
E | D010 (XX) 
i,j 


m 
* 7 2 
=): nx) - [oioj] + S—(xi.x;) EK lo; 
iAj i=1 

m 

2 2 

= SU si Sm max [xi |l3. 

i=1 


Combining this with Equation (26.15) and Equation (26.16) we conclude our proof. 
O 


Next we bound the Rademacher complexity of Hj 0S. 


Lemma 26.11. Let S = (x1,...,Xm) be vectors in R". Then, 


2log (2 
RCH 0S) < max|txio \/ —B. 


Proof. Using Holder’s inequality we know that for any vectors w, v we have (w, v) < 
lWll1 |IVlloo- Therefore, 


mR(H10S)= | sup Se 


acHjoS v4 


= | sup Som 


oO 
w:||wll|i<1 i=1 


= | sup ww Sona) 
i=l 


w:|| wll <1 


< aD se (26.17) 
i=1 


For each j € [n], let vj = («1,;,...,%m,j) € R”. Note that || vj||2 < ./m max; ||x; loo. Let 
V = {¥1,...,Vn,—V1,---,—VWn}. The right-hand side of Equation (26.17) is m R(V). 
Using Massart lemma (Lemma 26.8) we have that 


R(V) < max |X; lloo 2 log (2n)/m, 


which concludes our proof. O 


5.3 GENERALIZATION BOUNDS FOR SVM 


In this section we use Rademacher complexity to derive generalization bounds for 
generalized linear predictors with Euclidean norm constraint. We will show how this 
leads to generalization bounds for hard-SVM and soft-SVM. 
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We shall consider the following general constraint-based formulation. Let H = 
{w: ||wll2 < B} be our hypothesis class, and let Z = V x Y be the examples domain. 
Assume that the loss function €:H x Z — R is of the form 


ew, (x, y)) = (wx), y), (26.18) 


where ¢:R x Y — R is such that for all y € J, the scalar function at ¢(a, y) is p- 
Lipschitz. For example, the hinge-loss function, €(w, (x, y)) = max{0, 1 — y(w,x)}, 
can be written as in Equation (26.18) using ¢(a, y) = max{0,1— ya}, and note 
that @ is 1-Lipschitz for all y € {+1}. Another example is the absolute loss func- 
tion, £(w, (x, y)) = |(w,x) — y|, which can be written as in Equation (26.18) using 
o(a, y) = |a— y|, which is also 1-Lipschitz for all y € R. 

The following theorem bounds the generalization error of all predictors in H 
using their empirical error. 


Theorem 26.12. Suppose that D is a distribution over X x Y such that with probability 
1 we have that ||x||2 < R. Let H = {w: ||w|l2 < B} and let €:H x Z > R be a loss 
function of the form given in Equation (26.18) such that for all y € Y, at ¢(a,y) 
is a p-Lipschitz function and such that maxge{-gr,BR||(4,y)| < c. Then, for any 
6 € (0,1), with probability of at least 1 —6 over the choice of ani.i.d. sample of size m, 


2pBR [21n (2/8) 
VweH, Lp(w) < Ls(w)+—— ie +c —— 


Proof. Let F = {(x, y) @((w,x), y): we H}. We will show that with probability 1, 
R(F 0S) < pBR/,/m and then the theorem will follow from Theorem 26.5. Indeed, 
the set F o S can be written as 


Fo S= {(@((w, X1), 1), .ees3 P((W, Xn), Ym) :WeE Hy}, 


and the bound on R(F 0S) follows directly by combining Lemma 26.9, Lemma 26.10, 
and the assumption that ||x||2 < R with probability 1. O 


We next derive a generalization bound for hard-SVM based on the previous 
theorem. For simplicity, we do not allow a bias term and consider the hard-SVM 
problem: 

argmin ||w||?_ s.t. Vi, y;(w,x;) >1 (26.19) 
Ww 


Theorem 26.13. Consider a distribution D over X x {+1} such that there exists some 
vector w* with Pix y)~p |y(w*,x) = 1] = 1 and such that ||x||2 < R with probability 1. 
Let ws be the output of Equation (26.19). Then, with probability of at least 1 — 6 over 
the choice of S~D", we have that 


Zils Tl tO) 


P _[y #sign((ws,x))] < + (1+ R |lw*ll) 


(.y)~D 
Proof. Throughout the proof, let the loss function be the ramp loss (see 
Section 15.2.3). Note that the range of the ramp loss is [0,1] and that it is a 
1-Lipschitz function. Since the ramp loss upper bounds the zero-one loss, we 
have that 


ee [y # sign((ws,x))] < Lo(ws). 
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Let B = ||w*||2 and consider the set 1 = {w: ||w|l2 < B}. By the definition of hard- 
SVM and our assumption on the distribution, we have that ws € H with probability 
1 and that Ls(ws) = 0. Therefore, using Theorem 26.12 we have that 

2BR 2In(2/6) 


Eo(ws) = Ls(ws) + + rs 


O 


Remark 26.1. Theorem 26.13 implies that the sample complexity of hard-SVM 
R? Iw")? 
e2 


grows like . Using a more delicate analysis and the separability assumption, 


ae : . 2 age 2 
it is possible to improve the bound to an order of Aw 


The bound in the preceding theorem depends on ||w*||, which is unknown. In the 
following we derive a bound that depends on the norm of the output of SVM; hence 
it can be calculated from the training set itself. The proof is similar to the derivation 
of bounds for structure risk minimization (SRM). 


Theorem 26.14. Assume that the conditions of Theorem 26.13 hold. Then, with 
probability of at least 1 — 5 over the choice of S ~~ ‘D", we have that 


4log, (lIws ll) 

4k iwi. ) a 

P [y# ws. x))] < ——— + \/ ——2— . 
eee sign((ws,x))] — 


Proof. For any integer i, let B; = 2', H; = {w: ||w|| < B;}, and let 5; = or. Fix 7, then 


using Theorem 26.12 we have that with probability of at least 1 — 6; 


vwe Hi, Lo(w) = Ls(w)+ 228 4 ?E/0) 


Applying the union bound and using $>°°, 5; < 5 we obtain that with probability of 
at least 1 — 6 this holds for all i. Therefore, for all w, if we let i = [log, (||w/|)] then 


: E 2 
we Hj, B; <2||wl|, and 2 = oy < (Closa (wi) Therefore, 


Lo(w) = Ls(w) + 4 em 
<L sn) He UE ADEE) 


In particular, it holds for ws, which concludes our proof. O 


Remark 26.2. Note that all the bounds we have derived do not depend on the dimen- 
sion of w. This property is utilized when learning SVM with kernels, where the 
dimension of w can be extremely large. 


26.4 GENERALIZATION BOUNDS FOR PREDICTORS WITH LOW ¢; NORM 


In the previous section we derived generalization bounds for linear predictors with 
an £2-norm constraint. In this section we consider the following general £;-norm 
constraint formulation. Let H = {w: ||w||1 < B} be our hypothesis class, and let 
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Z =X x y be the examples domain. Assume that the loss function, €:H x Z > R, 
is of the same form as in Equation (26.18), with @: R x Y > R being p-Lipschitz 
w.r.t. its first argument. The following theorem bounds the generalization error of 
all predictors in H using their empirical error. 


Theorem 26.15. Suppose that D is a distribution over X x Y such that with probability 
1 we have that ||x||oo < R. Let H = {we R¢: ||w]|, < B} and let €:H x Z > R bea loss 
function of the form given in Equation (26.18) such that for all y € Y, at ¢(a,y) 
is an p-Lipschitz function and such that maxge{—pr,Br\\(4, y)| < c. Then, for any 
6 € (0,1), with probability of at least 1 —6 over the choice of ani.i.d. sample of size m, 


Proof. The proof is identical to the proof of Theorem 26.12, while relying on 
Lemma 26.11 instead of relying on Lemma 206.10. O 


It is interesting to compare the two bounds given in Theorem 26.12 and Theo- 
rem 26.15. Apart from the extra log (d) factor that appears in Theorem 26.15, both 
bounds look similar. However, the parameters B, R have different meanings in the 
two bounds. In Theorem 26.12, the parameter B imposes an £2 constraint on w and 
the parameter R captures a low £2-norm assumption on the instances. In contrast, 
in Theorem 26.15 the parameter B imposes an £; constraint on w (which is stronger 
than an £2 constraint) while the parameter R captures a low £.9-norm assumption on 
the instance (which is weaker than a low £2-norm assumption). Therefore, the choice 
of the constraint should depend on our prior knowledge of the set of instances and 
on prior assumptions on good predictors. 


26.5 BIBLIOGRAPHIC REMARKS 


The use of Rademacher complexity for bounding the uniform convergence is due to 
(Koltchinskii & Panchenko 2000, Bartlett & Mendelson 2001, Bartlett & Mendelson 
2002). For additional reading see, for example, (Bousquet 2002, Boucheron, 
Bousquet & Lugosi 2005, Bartlett, Bousquet & Mendelson 2005). Our proof of 
the concentration lemma is due to Kakade and Tewari lecture notes. Kakade, 
Sridharan, and Tewari (2008) gave a unified framework for deriving bounds on the 
Rademacher complexity of linear classes with respect to different assumptions on 
the norms. 
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In this chapter we describe another way to measure the complexity of sets, which is 
called covering numbers. 


27.1 COVERING 


Definition 27.1 (Covering). Let A c R” be a set of vectors. We say that A is r- 
covered by a set A’, with respect to the Euclidean metric, if for alla € A there exists 
a’ € A’ with ||a—a’|| <r. We define by N(r, A) the cardinality of the smallest A’ that 
r-covers A. 


Example 27.1 (Subspace). Suppose that A C R”, let c = maXaea ||al|, and assume 
that A lies in a d-dimensional subspace of R”. Then, N(r, A) < (2cVd/r)“. To see 


this, let v,,...,¥¢ be an orthonormal basis of the subspace. Then, any a€ A can be 
written as a= Be ajV; with ||a||oo < ||e||2 = |lall2 <c. Let e € R and consider the set 


d 
A= {Soe Wi, a cl-emebenet2e,.nelh 
i=1 


GivenacAs.t.a= ae av; With ||o||oo <c, there exists a’ € A’ such that 


lla—a’|? =| 50 @} -a)vill? se? So ivi? <7. 
i i 
Choose ¢ = r//d; then ||a—a’|| <r and therefore A’ is an r-cover of A. Hence, 


ae ee (2) 


r 


www.EngineeringBooksLibrary.com 


337 


338 


Covering Numbers 


27.1.1 Properties 
The following lemma is immediate from the definition. 
Lemma 27.2. For any A C R", scalar c > 0, and vector ay € R”, we have 
Vr >0, N(r,{ca+ag:a€ A}) < M(cr,A). 
Next, we derive a contraction principle. 


Lemma 27.3. For each i € [m], let 6; : R > R be a p-Lipschitz function; namely, for 
all a, B € R we have |¢;(a) — ¢;(B)| < p|a — B|. For ae R” let (a) denote the vector 
(¢1(a1),---, Pm(Gm)). Let 6o A = {f(a): a € A}. Then, 


N(pr,@oA) < N(r, A). 


Proof. Define B = ¢0 A. Let A’ be an r-cover of A and define B’ = go A’. Then, for 
allae A there exists a’ € A’ with ||a—a’|| <r. So, 


lld(a) — g(a’) | = » ($:(ai) — $i(a'))° < pd (ai —aly <(pr)’. 


Hence, B’ is an (pr)-cover of B. O 


2 FROM COVERING TO RADEMACHER COMPLEXITY VIA CHAINING 


Z 
é 


The following lemma bounds the Rademacher complexity of A based on the cov- 
ering numbers N(r,A). This technique is called Chaining and is attributed to 
Dudley. 


Lemma 27.4. Let c= ming MaxXae, ||a — al]. Then, for any integer M > 0, 


c2™ 6c eX 
R(A) < Ti + — $12" y/log(N(c2-*, A)). 
k=1 


Proof. Let a be a minimizer of the objective function given in the definition of c. 
On the basis of Lemma 26.6, we can analyze the Rademacher complexity assuming 
thata=0. 

Consider the set Bo = {0} and note that it is a c-cover of A. Let By,...,By 
be sets such that each By corresponds to a minimal (c2~*)-cover of A. Let a* = 
argmax,. 4 (o,a) (where if there is more than one maximizer, choose one in an arbi- 
trary way, and if a maximizer does not exist, choose a* such that (o,a*) is close 
enough to the supremum). Note that a* is a function of o. For each k, let b“ be the 
nearest neighbor of a* in B, (hence b“) is also a function of o). Using the triangle 
inequality, 


Ib —be—Y || <b — at] + fa* — BEY] <c(2* 42° Y) = 302%, 


For each k define the set 


By = {(a—a’):a € By,a! € By, |la—a’|| <3c2*}. 
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27.2 From Covering to Rademacher Complexity via Chaining 


We can now write 


R(A) = — E(o,a*) 


M 
=—E [. a* —b) + S06, b — wn) 
k= 


1 


1 amy | 
_ fF *  BH(M) _f 
< —E ||| Ia" —b™ || a [supe 


ac By 


Since ||o|| = /m and |la* —b\™) || < c2-™,, the first summand is at most wea. 
Additionally, by Massart lemma, 

\/2log (N(c2-¥, A?) \/log(N(c2-*, A)) 
1 <3c2-# 2log(N(c2-*, A) ) 6 ea-k log(N(c2 »A)) 


— Esup (o,a) 
ac By m Me 


Therefore, 


c2™ 6c - 
R(A) < Th + — 2“ vlog (N(c2-*, A). 
k=1 


As a corollary we obtain the following: 


Lemma 27.5. Assume that there are a, B > 0 such that for any k > 1 we have 


\/ log(N(c2-*, A)) < a+ Bk. 


R(A) < © (aw +2). 
m 


Then, 


Proof. The bound follows from Lemma 27.4 by taking M — oo and noting that 
Ve = Land.) a? 2. oO 
Example 27.2. Consider a set A which lies in a d dimensional subspace of R” and 


d 
such that c = maxXae,g |lal|. We have shown that N(r, A) < (24) . Therefore, for 


r 


any k, 
\/ log(N(c2-*, A)) < ,/dlog (21 Va) 
< \/dlog(2Vd) + Vkd 
< \/ dlog(2Vd) + Vdk. 
Hence Lemma 27.5 yields 


m 


R(A) < * ( Vlatog(2va) +22) =O (222). 
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27.3 BIBLIOGRAPHIC REMARKS 


The chaining technique is due to Dudley (1987). For an extensive study of covering 
numbers as well as other complexity measures that can be used to bound the rate of 
uniform convergence we refer the reader to (Anthony & Bartlet 1999). 
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Proof of the Fundamental Theorem 
of Learning Theory 


In this chapter we prove Theorem 6.8 from Chapter 6. We remind the reader the 
conditions of the theorem, which will hold throughout this chapter: H is a hypothesis 
class of functions from a domain % to {0,1}, the loss function is the 0 — 1 loss, and 
VCdim(H) =d < ov. 

We shall prove the upper bound for both the realizable and agnostic cases and 
shall prove the lower bound for the agnostic case. The lower bound for the realizable 
case is left as an exercise. 


1 THE UPPER BOUND FOR THE AGNOSTIC CASE 


For the upper bound we need to prove that there exists C such that H is agnostic 
PAC learnable with sample complexity 


d+in(1/8 
my(e,) <CoO), 


We will prove the slightly looser bound: 


dlog(d/e)+1n(1/6) 


my(e,6) <C a (28.1) 


The tighter bound in the theorem statement requires a more involved proof, in 
which a more careful analysis of the Rademacher complexity using a technique 
called “chaining” should be used. This is beyond the scope of this book. 
To prove Equation (28.1), it suffices to show that applying the ERM with a 
sample size 
m> a -log (S) + 5 - (8d log (e/d) + 2log (4/8)) 


yields an €, -learner for H. We prove this result on the basis of Theorem 26.5. 
Let (x1, y1),---, (Km, Ym) be a classification training set. Recall that the Sauer- 
Shelah lemma tells us that if VCdim(H) = d then 


(a1), Am) hE HY] < a 
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Denote A = {(Dprcxy)A¢y]> +++ Lam) ¢ym]) 2 € H}. This clearly implies that 
em\d 
ais (2 


Combining this with Lemma 26.8 we obtain the following bound on the Rademacher 


complexity: 
R(A) < [2dlog(em/d) 
m 


Using Theorem 26.5 we obtain that with probability of at least 1— 6, for every h eH 


we have that 


Repeating the previous argument for minus the zero-one loss and applying the 
union bound we obtain that with probability of at least 1 — 6, for every h € H it 


holds that 
ILo(h) — Ls(h)| < | Ga —— 
m m 
<7, Sdloalem/d) + 2low 78) 
m 


To ensure that this is smaller than e we need 
4 
m> Zz (8d log (m) + 8d log (e/d) + 2log (4/5)) . 


Using Lemma A.2, a sufficient condition for the inequality to hold is that 


32d 64d 8 
m> 4 -log (=) + a (8d log (e/d) + 2log(4/5)) . 


2 THE LOWER BOUND FOR THE AGNOSTIC CASE 


Here, we prove that there exists C such that H is agnostic PAC learnable with 
sample complexity 


d+In(1/8 
mule.) = CE), 


We will prove the lower bound in two parts. First, we will show that m(e,5) > 
0.5log(1/(46))/e?, and second we will show that for every 6 < 1/8 we have that 
m(€,5) > 8d/e*. These two bounds will conclude the proof. 


.1 Showing That m(e, 5) > 0.5log(1/(48)) /e? 


We first show that for any € < 1//2 and any 6 € (0,1), we have that m(e,5) > 
0.5log(1/(48))/e?. To do so, we show that for m < 0.5log(1/(48))/e?, H is not 
learnable. 

Choose one example that is shattered by 1. That is, let c be an example such that 
there are h4,h_ € H for which h4(c) = 1 and h_(c) = —1. Define two distributions, 
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D, and D_, such that for b € {+1} we have 


1+ybe ifx—c 


Dull) | 


2 
0 otherwise. 


That is, all the distribution mass is concentrated on two examples (c, 1) and (c, —1), 
where the probability of (c, b) is Hips and the probability of (c, —b) is LE, 

Let A be an arbitrary algorithm. Any training set sampled from D, has the 
form S = (c, y1),..-,(¢C, ¥m). Therefore, it is fully characterized by the vector y = 
(y1,---,¥m) € {£1}”". Upon receiving a training set S, the algorithm A returns a 
hypothesis A : & — {+1}. Since the error of A w.r.t. Dp, only depends on h(c), we 
can think of A as a mapping from {+1}” into {+1}. Therefore, we denote by A(y) 
the value in {+1} corresponding to the prediction of h(c), where h is the hypothesis 
that A outputs upon receiving the training set S = (c, y1),...,(C, Ym). 

Note that for any hypothesis h we have 


1—h(c)be 


Lp,(h) = 2 


In particular, the Bayes optimal hypothesis is h, and 
1—A(y)be 1-e e ifA(y)4b 
Lp, (A — Lp, (h,) = —————_ - — = 
Ds A(y)) — Loy hs) 2 2 rf otherwise. 


Fix A. For b € {+1}, let Y? = {y € {0, 1} : A(y) 4 5}. The distribution D, induces 
a probability P, over {+1}”. Hence, 


P[Lp,(A(y)) — Lp, (ho) = €] = Do(¥”) = 9 Poly] ago): 
y 


Denote Nt ={y: |{i: y; =1}| =>m/2} and N~ ={+1}""\ N*. Note that for any y¢ Nt 
we have P,[y] => P_[y] and for any ye N~ we have P_[y] > Pi[y]. Therefore, 


poe Eee (A(y)) — Lv, (tn) = €] 


= Psy] 
may Da Ly] fpacy)46) 


1 1 
= 5) Plytamen+ 5 2 Ply aoe 
y y 


1 
=5 de (Pely Maes + PLyltage-) 
yeNt 


1 
5 So (Pella + P-Lylttag ep) 
yoN— 
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1 
+5 Yo ely Hagen + P+lyl tage) 


yeN— 
1 1 
=5 > P_ly]+5 S- Ply). 
yeNt yeN- 


Next note that > yev+ P_[y] = Cyen- P+Ly], and both values are the probability that 
a Binomial (m, (1 — €)/2) random variable will have value greater than m/2. Using 
Lemma B.11, this probability is lower bounded by 


5(1- 1 exp(=me/(1— ©) zZ 5 (1-1 exp(=2me?)), 


where we used the assumption that e* < 1/2. It follows that if m < 0.5log(1/(48))/€? 
then there exists b such that 


P[Lp,(A(y)) — Lp, (A) = €] 


i 
>5(1- 1-Vi) > 0, 


where the last inequality follows by standard algebraic manipulations. This con- 
cludes our proof. 


28.2.2 Showing That m(«€, 1/8) > 8d /e? 


We shall now prove that for every € < 1/(8./2) we have that m(e, 6) > a 
Let p = 8e and note that p € (0, 1//2). We will construct a family of distributions 


as follows. First, let C = {c1,..., cq} be a set of d instances which are shattered by H. 
Second, for each vector (b1,...,ba) € {+1}¢, define a distribution D, such that 
1, Ayhe GP3p ee Se; 
D x, —Jd 2 3 : 
oy) fi otherwise. 


That is, to sample an example according to Dp, we first sample an element c; € C 
uniformly at random, and then set the label to be b; with probability (1 + p)/2 or 
—b; with probability (1 — p)/2. 

It is easy to verify that the Bayes optimal predictor for D, is the hypothesis h € H 
such that h(c;) = b; for all i € [d], and its error is . In addition, for any other 
function f : ¥ — {+1}, it is easy to verify that 

l+p |field|: fla) #bil , 1-p Mie ld): f(a) =bi}| 


Ig — a d 


Therefore, 


Lp,(f)—minLp,(h) =p - Hie [a]: Fei) # bd 


28.2 
heH d ( ) 
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28.2 The Lower Bound for the Agnostic Case 


Next, fix some learning algorithm A. As in the proof of the No-Free-Lunch 
theorem, we have that 


max i |Lp,(A(S)) — min Lp, (h 28.3 
Dp:be{£1}4 E»| 0» (A(S)) hew D,( ) ( ) 
= J i |Lp,(A(S)) — min Lo, (A 28.4 
> ye ano see | py (ACS) neH. Dp ( ) (28.4) 
Dyib~U ({£1}4) S~Dp" d : 
é d 
~d E » Macsy(edebil 28.6 
d ae b~U ({41}4) S~DM [A(S)(ci) 4b: ] ( ) 


where the first equality follows from Equation (28.2). In addition, using the defini- 
tion of Dp, to sample S ~ D, we can first sample (j1,..., jm) ~ U([d])”, set x, =cj;, 
and finally sample y, such that P[y, = b;,] = (1+ ¢)/2. Let us simplify the notation 
and use y ~ b to denote sampling according to P[y = b] = (1+ )/2. Therefore, the 
right-hand side of Equation (28.6) equals 


d 
p 
a, 


We now proceed in two steps. First, we show that among all learning algorithms, 
A, the one which minimizes Equation (28.7) (and hence also Equation (28.4)) is the 
Maximum-Likelihood learning rule, denoted Ay . Formally, for eachi, Awi(S)(ci) 
is the majority vote among the set {y, :r € [m],x- = c;}. Second, we lower bound 
Equation (28.7) for Ay. 


yi y | Lb; ]« 28.7 
j~U({d])™ brU({£1}2) ry ~b j, [A(S)(ci)#bi] ( ) 


Lemma 28.1. Among all algorithms, Equation (28.4) is minimized for A being the 
Maximum-Likelihood algorithm, Ay, defined as 


Vi, Amx(S)(ci) = sen ( S- vs) 


tik, =C; 


Proof. Fix some j € [d]”. Note that given j and y € {£1}”, the training set S is fully 
determined. Therefore, we can write A(j, y) instead of A(S). Let us also fix i € [d]. 
Denote b™ the sequence (b1,...,)-1, bi41,-..,bm). Also, for any y € {+1}, let y/ 
denote the elements of y corresponding to indices for which j, =i and let y~/ be 
the rest of the elements of y. We have 


y | er 
b~U({£1}4) Wr. Yr ~D jy [A(S)(ci) #0; ] 


-;¥ 


bef +I} 


i ae ydy Pb bb; pa > Ply! [bilMpaci.yy(ei Abi] 


Bre y! \bjef1} 


= » bi Macy, »)(ei¢b1] 


pmvilinye-t) 
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The sum within the parentheses is minimized when A(j, y)(c;) is the maximizer of 
Ply! |bi] over b; € {+1}, which is exactly the Maximum-Likelihood rule. Repeating 
the same argument for all i we conclude our proof. O 


Fix i. For every j, let n;(j) = {|t : j; =i|} be the number of instances in which the 
instance is c;. For the Maximum-Likelihood rule, we have that the quantity 


Y i AEB 
bru ((£1)4) Vrs yp ~D jy [Amz(S)(ci)4bi] 


is exactly the probability that a binomial (n;(j), (1 — e)/2) random variable will be 
larger than n;(j)/2. Using Lemma B.11, and the assumption p < 1/2, we have that 


PLB > ni(i)/212 5 (1 - 4/1 —entiv") . 


We have thus shown that 


Ta(sy(e:)4b7] 


d 
p * . 
d Do jhe pou ((el}4y¥raseby 


d 
p Dey 
> J 1—1/1—e-2e n() 
7p ae 
p / ; 
a Y 1-4/2 2n; ) 
~ 2d aa j~U([d])" ( ray) 


where in the last inequality we used the inequality 1 — e~* <a. 
Since the square root function is concave, we can apply Jensen’s inequality to 
obtain that the above is lower bounded by 


(view) 
=§ (1- y2e'm/a). 


As long asm < this term would be — than p/4. 


8p a? 
In summary, we have shown that if m < x then for any algorithm there exists a 


distribution such that 


ee z,|t p(A(S)) — i > p/4. 


www.EngineeringBooksLibrary.com 


28.3 The Upper Bound for the Realizable Case 347 


Finally, Let A = +(Lv(A(S)) — minnex Lp(h)) and note that A ¢€ [0,1] (see 
Equation (28.5)). Therefore, using Lemma B.1, we get that 


P[Lp(A(S)) ~ min Lp(h) > €] =P E > < >E[A]-— 


2 


Ale 
dla 


Choosing p = 8€ we conclude that if m < s then with probability of at least 1/8 we 
will have Lp(A(S)) — mingex Lo(h) = €. 


28.3 THE UPPER BOUND FOR THE REALIZABLE CASE 


Here we prove that there exists C such that H is PAC learnable with sample 
complexity 


din(1/e)+In(1/6) 


my(€, 6) <C 
€ 


We do so by showing that for m > C do Gie Hin G/8) H is learnable using the ERM 
tule. We prove this claim on the basis of the notion of €-nets. 


Definition 28.2 («-net). Let ¥ be a domain. S$ C ¥ is an e-net for H Cc 2* with 
respect to a distribution D over if 


WheH: Dih)>e > hNSFO. 


Theorem 28.3. Let H Cc 2* with VCdim(H) = d. Fix € € (0,1), 6 € (0,1/4) and let 


8 16e 2 
— { 2dlog | — ) +log( =} ). 
ma (2toa(=) +h(3)) 


Then, with probability of at least 1— 65 over a choice of S~D" we have that S is an 
e-net for H. 


Proof. Let 
B={SCHX:|S|=m, IhEH, D(h) =e, hNS =} 


be the set of sets which are not e-nets. We need to bound P[S € B]. Define 


B’={(S,T) CX : |S|=|T| =m, 3h EH, D(h) > €, ANS =G, |TNA| > SY. 


Claim 1 
P[S ¢ B] <2P[(S,T) € B’. 
Proof of Claim 1: Since S and T are chosen independently we can write 


P[(S,T) € B’]= 


[1is.r)eB)]] = [1s.r)eB] | - 


(8,T)~D2m S~D™ | TLpm 
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Note that (S, 7) € B’ implies S € B and therefore 1s r)<8/] = lys,r)eB"] Ysea}, which 
gives 


Ls,ryeB) Yea] 


P[(S,T)eBJ= E { 
[( ) ] SYD" T~Dm 


(s,r)eB")- 


SeB] 


— A alt A 
S~D™ ~pDimn 


Fix some S. Then, either Is-g) = 0 or S € B and then Jhs such that D(hs) > € and 
|hs NS| =0. It follows that a sufficient condition for (S, 7) € B’ is that |TNhs| > +. 
Therefore, whenever S € B we have 


Is,nes] = RP (ITAhs|> $I. 


T~pD™ T~D™ 


But, since we now assume S € B we know that D(hs) = p > €. Therefore, |TNhs| 
is a binomial random variable with parameters p (probability of success for a single 
try) and m (number of tries). Chernoff’s inequality implies 


2 
P[ITNhs| < on) < on mpme—mp/2)" = eo mp/2 < enme/2 < e Glog (1/5)/2 a= 54/2 ee 1/2. 
Thus, 
P[|T Nhs| > $) = 1—P[|TNhs| < $F] = 1—-P[|TNhs| < |] = 1/2. 
Combining all the preceding we conclude the proof of Claim 1. 
Claim 2 (Symmetrization): 
P[(S, T) € B’] < e7©"/* t44(2m). 


Proof of Claim 2: To simplify notation, let a = me/2 and for a sequence A = 
(x1,---,X2m) let Ao = (%1,..-,%m). Using the definition of B’ we get that 


P[Ae€ B= ee max Lpp(n)>4] paragl=0] Manatee] 


< FE maxi —o) Ll ; 
Tee [|ANAg|=0] {fJanal|=o] 


Now, let us define by H, the effective number of different hypotheses on A, namely, 
Ha ={hNA:heH }. It follows that 


P[AcB]< E Django) Hi 
[Ae Bs Emax Myanagi=o] Mtinalze) 


= y S- UanAol=0] Uanal>a]- 
A~D2m 
heH, 


Let J = {j Cc [2m] : |j] = m}. For any j € J and A = (x1,...,x2) define Aj = 
(xj,,--+-»Xj»,). Since the elements of A are chosen 1.i.d., we have that for anyj € J and 
any function f(A, Ao) it holds that E4~p2m[f(A, Ao)] = E4~p2m [f(A, Aj)]. Since 
this holds for any j it also holds for the expectation of j chosen at random from J. 
In particular, it holds for the function f(A, Ao) = ye D\anao|=0] Lpjnnal=a]. We 
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therefore obtain that 


P[A€ B']< 


S janajl=0] Wanal>a] 


~M2m j~J 
ES ait, 


= E S Wanalzo}] E. Ujnna,|=oy- 
m2 iNAlzay -™ ,oThNAj=0] 
AMD hea 4 : 


Now, fix some A s.t. |2M A| > a. Then, E; pan aj|=0] is the probability that when 
choosing m balls from a bag with at least a red balls, we will never choose a red ball. 
This probability is at most 


et ~ a/(2m))” = (1 =- €/4)” < e7em/4 


We therefore get that 


P[Ae B’] < + gel ze em/4 7 Hal. 
A~D2m A~D2m 
heHa 


Using the definition of the growth function we conclude the proof of Claim 2. 


Completing the Proof: By Sauer’s lemma we know that 17,(2m) < (2em/d)¢. 
Combining this with the two claims we obtain that 


P[S € B] < 2(2em/d)4 e~e"4. 
We would like the right-hand side of the inequality to be at most 6; that is, 
2(2em/d)4 e~"4 <6, 
Rearranging, we obtain the requirement 
m> : (dlog (2em/d) + log (2/5)) = “ jog(m) + *(dlog (2e/d) + log (2/6). 
Using Lemma A.2, a sufficient condition for the preceding to hold is that 
m> — los (=) + = (dlog(2e/d) + log (2/6). 
A sufficient condition for this is that 


16 8d\ 16 
He 10g (=) 4 — (dlog (2e/d) + 5 log (2/8) 


16d 1 8d 2e m 8, (2/8) 
= — (lo —lo 
€ s de € B/ 


-¥(2ma(#)40(2)) 


and this concludes our proof. O 


28.3.1 From «-Nets to PAC Learnability 


Theorem 28.4. Let H be a hypothesis class over X with VCdim(H) = d. Let D be a 
distribution over X and let c € H be a target hypothesis. Fix €,5 € (0,1) and let m be 
as defined in Theorem 28.3. Then, with probability of at least 1 — 6 over a choice of m 
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iid. instances from X with labels according to c we have that any ERM hypothesis 
has a true error of at most €. 


Proof. Define the class HS = {cAh:h € H}, where cAh = (h\c)U(c\hA). It is 
easy to verify that if some A C 4% is shattered by H then it is also shattered by H° 
and vice versa. Hence, VCdim(H) = VCdim(H*). Therefore, using Theorem 28.3 
we know that with probability of at least 1 — 6, the sample S is an e-net for H°. 
Note that Lp(h) = D(h Ac). Therefore, for any h € H with Lp(h) = € we have that 
\(h Ac) S| > 0, which implies that cannot be an ERM hypothesis, which concludes 
our proof. O 
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Multiclass Learnability 


In Chapter 17 we have introduced the problem of multiclass categorization, in which 
the goal is to learn a predictor h : ¥ — [k]. In this chapter we address PAC learn- 
ability of multiclass predictors with respect to the 0-1 loss. As in Chapter 6, the main 
goal of this chapter is to: 


™ Characterize which classes of multiclass hypotheses are learnable in the (multi- 
class) PAC model. 
®™ Quantify the sample complexity of such hypothesis classes. 


In view of the fundamental theorem of learning theory (Theorem 6.8), it is natu- 
ral to seek a generalization of the VC dimension to multiclass hypothesis classes. 
In Section 29.1 we show such a generalization, called the Natarajan dimension, and 
state a generalization of the fundamental theorem based on the Natarajan dimen- 
sion. Then, we demonstrate how to calculate the Natarajan dimension of several 
important hypothesis classes. 

Recall that the main message of the fundamental theorem of learning theory is 
that a hypothesis class of binary classifiers is learnable (with respect to the 0-1 loss) 
if and only if it has the uniform convergence property, and then it is learnable by 
any ERM learner. In Chapter 13, Exercise 29.2, we have shown that this equiv- 
alence breaks down for a certain convex learning problem. The last section of this 
chapter is devoted to showing that the equivalence between learnability and uniform 
convergence breaks down even in multiclass problems with the 0-1 loss, which are 
very similar to binary classification. Indeed, we construct a hypothesis class which is 
learnable by a specific ERM learner, but for which other ERM learners might fail 
and the uniform convergence property does not hold. 


29.1 THE NATARAJAN DIMENSION 


In this section we define the Natarajan dimension, which is a generalization of the 
VC dimension to classes of multiclass predictors. Throughout this section, let H 
be a hypothesis class of multiclass predictors; namely, each A € H is a function 
from 4% to [k]. 
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To define the Natarajan dimension, we first generalize the definition of 
shattering. 


Definition 29.1 (Shattering (Multiclass Version)). We say that a set C C % is 
shattered by H if there exist two functions fo, fj : C — [k] such that 


M@ For every x EC, fo(x) 4 fi(x). 
™ For every B CC, there exists a function h € H such that 


Vx € B,h(x) = fo(x) and Vx €C\ B,h(x)= fi(x). 


Definition 29.2 (Natarajan Dimension). The Natarajan dimension of H, denoted 
Ndim(#), is the maximal size of a shattered set CC ¥. 


It is not hard to see that in the case that there are exactly two classes, 
Ndim(H) = VCdim(#). Therefore, the Natarajan dimension generalizes the VC 
dimension. We next show that the Natarajan dimension allows us to generalize the 
fundamental theorem of statistical learning from binary classification to multiclass 
classification. 


29.2 THE MULTICLASS FUNDAMENTAL THEOREM 


Theorem 29.3 (The Multiclass Fundamental Theorem). There exist absolute con- 
stants C,,Cz > 0 such that the following holds. For every hypothesis class H of 
functions from & to [k], such that the Natarajan dimension of H is d, we have 


1. H has the uniform convergence property with sample complexity 


d+log(1/6) 7 loa s oe 2) 
= e . 


Ci 
=) 


my (e,6) <C 


2. His agnostic PAC learnable with sample complexity 


ere) 2 dlog (k) + log (1/6) 
= e2 . 


Cc 
ez 


my(€,5) < C2 


3. His PAC learnable (assuming realizability) with sample complexity 


dlog (44) +log(1/8) 
oo. 


d+l1 1/6 
¢A og (l/ Deeley eG, 
€ 


N 


9.2.1 On the Proof of Theorem 29.3 


The lower bounds in Theorem 29.3 can be deduced by a reduction from the binary 
fundamental theorem (see Exercise 29.5). 

The upper bounds in Theorem 29.3 can be proved along the same lines of the 
proof of the fundamental theorem for binary classification, given in Chapter 28 
(see Exercise 29.4). The sole ingredient of that proof that should be modified in 
a nonstraightforward manner is Sauer’s lemma. It applies only to binary classes and 
therefore must be replaced. An appropriate substitute is Natarajan’s lemma: 


Lemma 29.4 (Natarajan). |H| < |&|Ndim(@ . ¢2Ndim(H), 
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29.3 Calculating the Natarajan Dimension 


The proof of Natarajan’s lemma shares the same spirit of the proof of Sauer’s 
lemma and is left as an exercise (see Exercise 29.3). 


29.3 CALCULATING THE NATARAJAN DIMENSION 


In this section we show how to calculate (or estimate) the Natarajan dimension of 
several popular classes, some of which were studied in Chapter 17. As these cal- 
culations indicate, the Natarajan dimension is often proportional to the number of 
parameters required to define a hypothesis. 


29.3.1 One-vs.-All Based Classes 


In Chapter 17 we have seen two reductions of multiclass categorization to binary 
classification: One-vs.-All and All-Pairs. In this section we calculate the Natarajan 
dimension of the One-vs.-All method. 

Recall that in One-vs.-All we train, for each label, a binary classifier that dis- 
tinguishes between that label and the rest of the labels. This naturally suggests 
considering multiclass hypothesis classes of the following form. Let Hpin C {0, 1}* 
be a binary hypothesis class. For every h=(h,...,hx) € (Hpin)« define T(h) os 
[k] by 

T (h)(x) = argmaxh;(x). 
ie[k] 


If there are two labels that maximize h;(x), we choose the smaller one. Also, let 
Hon * = {7 (h) + he Hin)". 
What “should” be the Natarajan dimension of fag Intuitively, to specify a 


hypothesis in Hpin we need d = VCdim(Hpin) parameters. To specify a hypothe- 


sis in He, we need to specify k hypotheses in Hpin. Therefore, kd parameters 


should suffice. The following lemma establishes this intuition. 
Lemma 29.5. If d = VCdim(Hpin) then 
Ndim(Ho\*"*) < 3kd log (kd). 


Proof. Let C Cc & be a shattered set. By the definition of shattering (for multiclass 
hypotheses) 


osu), = 2° 


On the other hand, each hypothesis in qo is determined by using k hypotheses 


from Hpin. Therefore, 
(Hon) |S LC tinde 
By Sauer’s lemma, | (Hbin)c | < |C|¢. We conclude that 


gicl < (wen) < ||. 


The proof follows by taking the logarithm and applying Lemma A.1. O 
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How tight is Lemma 29.5? It is not hard to see that for some classes, 
Ndim(Ho.*"*) can be much smaller than dk (see Exercise 29.1). However, there 
are several natural binary classes, Hpin (e.g., halfspaces), for which Ndim(He.*"*) = 


Q(dk) (see Exercise 29.6). 


29.3.2 General Multiclass-to-Binary Reductions 


The same reasoning used to establish Lemma 29.5 can be used to upper bound the 
Natarajan dimension of more general multiclass-to-binary reductions. These reduc- 
tions train several binary classifiers on the data. Then, given a new instance, they 
predict its label by using some rule that takes into account the labels predicted by 
the binary classifiers. These reductions include One-vs.-All and All-Pairs. 

Suppose that such a method trains / binary classifiers from a binary class Hpbin, 
and r : {0,1}! — [k] is the rule that determines the (multiclass) label according to 
the predictions of the binary classifiers. The hypothesis class corresponding to this 
method can be defined as follows. For every h = (h1,...,h;) € (Hpin)! define R(h) : 
X — [k] by - 

R(h)(x) =r(hy(x),...,41(x)). 
Finally, let a 

Hi, ={(R(h) : he (bin)'}. 
Similarly to Lemma 29.5 it can be proven that: 
Lemma 29.6. [fd = VCdim(Hpin) then 

Ndim(Hj},) < 3/d log(/d). 


The proof is left as Exercise 29.2. 


29.3.3 Linear Multiclass Predictors 


Next, we consider the class of linear multiclass predictors (see Section 17.2). Let 
Ww: X x [k] > R¢ be some class-sensitive feature mapping and let 


Hy = p r> argmax(w, U(x,i)) : we e'| : (29.1) 
ie[k] 


Each hypothesis in Hy is determined by d parameters, namely, a vector w € R¢. 
Therefore, we would expect that the Natarajan dimension would be upper bounded 
by d. Indeed: 


Theorem 29.7. Ndim(Hy) <d. 


Proof. Let C C X be a shattered set, and let fo, fi : C > [k] be the two functions 
that witness the shattering. We need to show that |C| <d. For every x € C let p(x) = 


W(x, fo(x)) — V(x, fi(x)). We claim that the set o(C) = {o(x) : x € C} consists of 
|C| elements (i.e., p is one to one) and is shattered by the binary hypothesis class of 
homogeneous linear separators on R¢, 


H = {xt sign((w,x)) : we R“}. 
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29.4 On Good and Bad ERMs 


Since VCdim(#) = d, it will follow that |C| = |o(C)| < d, as required. 
To establish our claim it is enough to show that |H,(c)| = 2'Cl, Indeed, given a 
subset B C C, by the definition of shattering, there exists hg € Hw for which 


Vx € B,hp(x)= fo(x) and VxeEC\B,hp(x)= fi(x). 
Let wz € R¢ be a vector that defines hz. We have that, for every xe B, 


(w, W(x, fo(x))) > (w. Yr, fi) = (w, e(x)) > 0. 


Similarly, for every x € C\ B, 
(w, p(x)) <0. 


It follows that the hypothesis gz € H defined by the same w € R? label the points in 
p(B) by 1 and the points in p(C \ B) by 0. Since this holds for every B C C we obtain 
that |C| =|o(C)| and |Hyc)| = 2!Cl, which concludes our proof. Oo 


The theorem is tight in the sense that there are mappings W for which 
Ndim(Hw) = Q(d). For example, this is true for the multivector construction 
(see Section 17.2 and the Bibliographic Remarks at the end of this chapter). We 
therefore conclude: 


Corollary 29.8. Let ¥ = R" and let © : X x [k] > R” be the class sensitive feature 
mapping for the multi-vector construction: 


W(x, y) =[0,...,0, 41,...,42, 0,...,0 ]. 
Se 
eR(y-Dn eR" eR(k-y)n 


Let Hw be as defined in Equation (29.1). Then, the Natarajan dimension of Hw 
satisfies 
(k—1)(n—1) < Ndim(Hw) < kn. 


29.4 ON GOOD AND BAD ERMS 


In this section we present an example of a hypothesis class with the property that not 
all ERMs for the class are equally successful. Furthermore, if we allow an infinite 
number of labels, we will also obtain an example of a class that is learnable by some 
ERM, but other ERMs will fail to learn it. Clearly, this also implies that the class is 
learnable but it does not have the uniform convergence property. For simplicity, we 
consider only the realizable case. 

The class we consider is defined as follows. The instance space 4 will be any 
finite or countable set. Let P¢(4) be the collection of all finite and cofinite subsets 
of X (that is, for each A € Pr(¥X), either A or XY \ A must be finite). Instead of [k], 
the label set is V = P¢(4’) U{*}, where « is some special label. For every A € P7(4) 


define h4: 4 — Y by 
A xeéeA 
ha(x)= 
x x¢€A 
Finally, the hypothesis class we take is 


H= {ha : AE P/(X)}. 
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Let A be some ERM algorithm for 1. Assume that A operates on a sample labeled 
by hy €H. Since hg is the only hypothesis in 1 that might return the label A, if 
A observes the label A, it “knows" that the learned hypothesis is ha, and, as an 
ERM, must return it (note that in this case the error of the returned hypothesis 
is 0). Therefore, to specify an ERM, we should only specify the hypothesis it returns 
upon receiving a sample of the form 


S = {(x1,*),---, (Xm, *)}- 
We consider two ERMs: The first, Agooa, is defined by 
Agood(S) = hg; 


that is, it outputs the hypothesis which predicts ‘*’ for every x € VY. The second 
ERM, Aoaa, is defined by 


Abaa(S) = Myxy,...xm}° o 


The following claim shows that the sample complexity of Ajaa is about ||-times 
larger than the sample complexity of Agoog. This establishes a gap between different 
ERMs. If 4% is infinite, we even obtain a learnable class that is not learnable by 
every ERM. 


Claim 29.9. 


I. Let €,5 > 0, D a distribution over X and ha €H. Let S be an i.i.d. sample 
consisting of m > + log (4) examples, sampled according to D and labeled by 
ha. Then, with probability of at least 1 — 6, the hypothesis returned by Agooa 
will have an error of at most €. 

2. There exists a constant a > 0 such that for every 0 < € <a there exists a dis- 
tribution D over X and ha €H such that the following holds. The hypothesis 


returned by Apaa upon receiving a sample of sizem < ao sampled according 


to D and labeled by ha, will have error > € with probability > a 


Proof. Let D be a distribution over ¥ and suppose that the correct labeling is h4. 
For any sample, Agoog returns either hg or hg. If it returns h, then its true error is 
zero. Thus, it returns a hypothesis with error > € only if all the m examples in the 
sample are from % \ A while the error of hg, Lp(hg) = Pp [A], is => €. Assume m > 
+ log (+); then the probability of the latter event is no more than (1 —«)” <e-*" <6. 
This establishes item 1. 

Next we prove item 2. We restrict the proof to the case that || =d < oo. The 
proof for infinite ¥V is similar. Suppose that V = {xo,...,xa_1}- 

Let a > 0 be small enough such that 1 — 2e > e~*¢ for every € < a and fix some 
e <a. Define a distribution on 1 by setting P[xo] = 1 — 2¢ and for alll <i<d—-1, 
Ply]= oe. Suppose that the correct hypothesis is hg and let the sample size be m. 
Clearly, the hypothesis returned by Ajaq will err on all the examples from Y which 
are not in the sample. By Chernoff’s bound, if m < ee then with probability > es, 
the sample will include no more than a examples from +. Thus the returned 
hypothesis will have error > e. O 
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29.6 Exercises 


The conclusion of the example presented is that in multiclass classification, the 
sample complexity of different ERMs may differ. Are there “good” ERMs for every 
hypothesis class? The following conjecture asserts that the answer is yes. 


Conjecture 29.10. The realizable sample complexity of every hypothesis class H C 
[k]* is 
~ (Ndi 
my(e,8) = 0 aH) 
€ 


We emphasize that the O notation may hide only poly-log factors of €,6, and 
Ndim(#), but no factor of k. 


29.5 
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1. Let Hdiscrete be the class of all functions f : [k —1] x [d —1] — {0,1} for which 
there exists some ig such that, for every j € [d—1] 


Vi <io, f(i, j) =1 while Vi > io, f(i, j) =0. 


Show that Ndim(HGYA*) = (d—1)-(k-1). 


2. Show that Hdiscrete can be realized by H. That is, show that there exists a 
mapping w : [k — 1] x [d —1] > R@ such that 


Hdiscrete C {how : he H}. 


Hint: You can take (i, j) to be the vector whose jth coordinate is 1, whose 
last coordinate is 7, and the rest are zeros. 
3. Conclude that Ndim(HO*) > (d —1)-(k—1). 
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Compression Bounds 


Throughout the book, we have tried to characterize the notion of learnability using 
different approaches. At first we have shown that the uniform convergence prop- 
erty of a hypothesis class guarantees successful learning. Later on we introduced 
the notion of stability and have shown that stable algorithms are guaranteed to be 
good learners. Yet there are other properties which may be sufficient for learning, 
and in this chapter and its sequel we will introduce two approaches to this issue: 
compression bounds and the PAC-Bayes approach. 

In this chapter we study compression bounds. Roughly speaking, we shall see 
that if a learning algorithm can express the output hypothesis using a small subset 
of the training set, then the error of the hypothesis on the rest of the examples 
estimates its true error. In other words, an algorithm that can “compress” its output 
is a good learner. 


30.1 COMPRESSION BOUNDS 


To motivate the results, let us first consider the following learning protocol. First, 
we sample a sequence of k examples denoted T. On the basis of these examples, we 
construct a hypothesis denoted h;. Now we would like to estimate the performance 
of hr so we sample a fresh sequence of m—k examples, denoted V, and calculate the 
error of hy on V. Since V and T are independent, we immediately get the following 
from Bernstein’s inequality (see Lemma B.10). 


Lemma 30.1. Assume that the range of the loss function is [0,1]. Then, 


P | Lp(hr)—Ly(Ar)= “eee ee <6. 


To derive this bound, all we needed was independence between T and V. 
Therefore, we can redefine the protocol as follows. First, we agree on a sequence 
of k indices I = (i1,...,ix) € [m]*. Then, we sample a sequence of m examples 
S = (Z1,...,Zm). Now, define T = S; = (zi,,...,Zi,) and define V to be the rest of 
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the examples in S. Note that this protocol is equivalent to the protocol we defined 
before — hence Lemma 30.1 still holds. 

Applying a union bound over the choice of the sequence of indices we obtain 
the following theorem. 


Theorem 30.2. Let k be an integer and let B : Z‘ > H be a mapping from sequences 
of k examples to the hypothesis class. Let m > 2k be a training set size and let A: 
Z™ — H be a learning rule that receives a training sequence S of size m and returns 
a hypothesis such that A(S) = B(zi,,..-,Zi,) for some (i1,...,ix) € [m]*. Let V = {z;: 
J ¢(i1,---,ix)} be the set of examples which were not selected for defining A(S). Then, 
with probability of at least 1 — 6 over the choice of S we have 


Lp(A(S)) < Ly(A(S))+ Ly(A(s)) Seen), Seow tn/>) 


Proof. For any I € [m]* let hy = B(zi,,...,zi,). Let n = m —k. Combining 
Lemma 30.1 with the union bound we have 


P [a é[m]* st. Lo (hz)— Ly (hy) = evi Moe Cl) ee) 
2 >. P| Loti) Leth = ZErres Ci), este) 


Te[m]* 


<m*5. 


Denote 5’ = m5. Using the assumption k < m/2, which implies that n =m —k > m/2, 
the above implies that with probability of at least 1 — 5’ we have that 


Lp(A(S)) < Ly(A(S))+ yf Lv(A(s)) SoBe) Slow tn/e) 


which concludes our proof. O 
As a direct corollary we obtain: 


Corollary 30.3. Assuming the conditions of Theorem 30.2, and further assuming that 
Ly(A(S)) = 0, then, with probability of at least 1 — 6 over the choice of S we have 


Lo(a(s)) < Seser/) 


These results motivate the following definition: 


Definition 30.4. (Compression Scheme) Let 7 be a hypothesis class of functions 
from ¥ to Y and let k be an integer. We say that H has a compression scheme of 
size k if the following holds: 


For all m there exists A : Z” — [m]* and B: Z* + H such that for all h € H, if we 
feed any training set of the form (*1,/(%1)),...,(%m,4(%m)) into A and then feed 
(x;,, (xi, )),.--, Xi, A(%i,)) into B, where (i1,...,i,) is the output of A, then the 
output of B, denoted h’, satisfies Ls(h’) = 0. 
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30.2 Examples 


It is possible to generalize the definition for unrealizable sequences as follows. 


Definition 30.5. (Compression Scheme for Unrealizable Sequences) Let H be a 
hypothesis class of functions from ¥ to ¥ and let k be an integer. We say that H 
has a compression scheme of size k if the following holds: 

For all m there exists A: Z” — [m]* and B: Z* — H such that for all h € H, 
if we feed any training set of the form (x1, y1),...,(4m,¥m) into A and then feed 
(Xi,,Vi,),-.-5 (Xs Yz,) into B, where (i1,...,7,) is the output of A, then the output of 
B, denoted h’, satisfies Ls(h’) < Ls(h). 


The following lemma shows that the existence of a compression scheme for 
the realizable case also implies the existence of a compression scheme for the 
unrealizable case. 


Lemma 30.6. Let H. be a hypothesis class for binary classification, and assume it 
has a compression scheme of size k in the realizable case. Then, it has a compression 
scheme of size k for the unrealizable case as well. 


Proof. Consider the following scheme: First, find an ERM hypothesis and denote 
it by A. Then, discard all the examples on which h errs. Now, apply the realizable 
compression scheme on the examples that have not been removed. The output of 
the realizable compression scheme, denoted h’, must be correct on the examples that 
have not been removed. Since / errs on the removed examples it follows that the 
error of h’ cannot be larger than the error of h; hence h’ is also an ERM hypothesis. 

O 


30.2 EXAMPLES 


In the examples that follows, we present compression schemes for several hypothe- 
sis classes for binary classification. In light of Lemma 30.6 we focus on the realizable 
case. Therefore, to show that a certain hypothesis class has a compression scheme, 
it is necessary to show that there exist A, B, and k for which Ls(h’) =0. 


30.2.1 Axis Aligned Rectangles 


Note that this is an uncountable infinite class. We show that there is a simple 
compression scheme. Consider the algorithm A that works as follows: For each 
dimension, choose the two positive examples with extremal values at this dimension. 
Define B to be the function that returns the minimal enclosing rectangle. Then, for 
k = 2d, we have that in the realizable case, Ls(B(A(S))) = 0. 


30.2.2 Halfspaces 
Let X = R¢ and consider the class of homogenous halfspaces, {x +> sign((w,x)) : 


we R4}. 


A Compression Scheme: 
W.1.0.g. assume all labels are positive (otherwise, replace x; by y;x;). The compres- 
sion scheme we propose is as follows. First, A finds the vector w which is in the 
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convex hull of {x1,...,X.} and has minimal norm. Then, it represents it as a convex 
combination of d points in the sample (it will be shown later that this is always pos- 
sible). The output of A are these d points. The algorithm B receives these d points 
and set w to be the point in their convex hull of minimal norm. 

Next we prove that this indeed is a compression sceme. Since the data is linearly 
separable, the convex hull of {x,,...,x,.} does not contain the origin. Consider the 
point w in this convex hull closest to the origin. (This is a unique point which is the 
Euclidean projection of the origin onto this convex hull.) We claim that w separates 
the data.! To see this, assume by contradiction that (w,x;) < 0 for some i. Take 


wi” 5 € (0,1). Then w’ is also in the convex hull and 


/ 
w= 1—a)w+ax; for a = —_, 
( ) 9 IIx;||?-+Ilwll 


Iw! ||? = (1 — @)? |]wIl? + @? IIx; ||? + 201 — w) (w, x;) 
< (1— a)? || wi? + 7 |Ix; lI? 
_ lle lhl? + Ihe? wil? 
(ijwll2 + xill2)? 
—__lixill? Iwi? 
lll? + hx I? 


= wl" aaeear 
Iwi? /Ix? +1 


2 
< |IwIl’, 


which leads to a contradiction. 

We have thus shown that w is also an ERM. Finally, since w is in the convex hull 
of the examples, we can apply Caratheodory’s theorem to obtain that w is also in the 
convex hull of a subset of d +1 points of the polygon. Furthermore, the minimality 
of w implies that w must be on a face of the polygon and this implies it can be 
represented as a convex combination of d points. 

It remains to show that w is also the projection onto the polygon defined by the 
d points. But this must be true: On one hand, the smaller polygon is a subset of the 
larger one; hence the projection onto the smaller cannot be smaller in norm. On the 
other hand, w itself is a valid solution. The uniqueness of projection concludes our 
proof. 


30.2.3 Separating Polynomials 

Let ¥ = R¢ and consider the class x +> sign(p(x)) where p is a degree r polynomial. 
Note that p(x) can be rewritten as (w, w(x)) where the elements of w(x) are 

all the monomials of x up to degree r. Therefore, the problem of constructing a 

compression scheme for p(x) reduces to the problem of constructing a compression 

scheme for halfspaces in R“’ where d’ = O(d’). 


! Tt can be shown that w is the direction of the max-margin solution. 
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PAC-Bayes 


The Minimum Description Length (MDL) and Occam’s razor principles allow a 
potentially very large hypothesis class but define a hierarchy over hypotheses and 
prefer to choose hypotheses that appear higher in the hierarchy. In this chapter we 
describe the PAC-Bayesian approach that further generalizes this idea. In the PAC- 
Bayesian approach, one expresses the prior knowledge by defining prior distribution 
over the hypothesis class. 


31.1 PAC-BAYES BOUNDS 


As in the MDL paradigm, we define a hierarchy over hypotheses in our class H. 
Now, the hierarchy takes the form of a prior distribution over .. That is, we assign 
a probability (or density if H is continuous) P(h) > 0 for each h € H and refer to 
P(h) as the prior score of h. Following the Bayesian reasoning approach, the output 
of the learning algorithm is not necessarily a single hypothesis. Instead, the learning 
process defines a posterior probability over H, which we denote by @Q. In the context 
of a supervised learning problem, where 1 contains functions from 1 to Y, one can 
think of Q as defining a randomized prediction rule as follows. Whenever we get a 
new instance x, we randomly pick a hypothesis h € H according to Q and predict 
h(x). We define the loss of Q on an example z to be 


def 


(0.2) © E Le(h.2)} 


By the linearity of expectation, the generalization loss and training loss of Q can be 


written as 


Ly(Q)= E,[Lo(h)] and L9(0) = E [s(h)] 


The following theorem tells us that the difference between the generalization 
loss and the empirical loss of a posterior Q is bounded by an expression that depends 
on the Kullback-Leibler divergence between Q and the prior distribution P. The 
Kullback-Leibler is a natural measure of the distance between two distributions. 
The theorem suggests that if we would like to minimize the generalization loss of Q, 
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we should jointly minimize both the empirical loss of Q and the Kullback-Leibler 
distance between Q and the prior distribution. We will later show how in some cases 
this idea leads to the regularized risk minimization principle. 


Theorem 31.1. Let D be an arbitrary distribution over an example domain Z. Let H 
be a hypothesis class and let £:H x Z — [0,1] be a loss function. Let P be a prior 
distribution over H. and let & € (0,1). Then, with probability of at least 1 —& over 
the choice of an i.i.d. training set S = {z1,...,Zm} sampled according to D, for all 
distributions Q over H. (even such that depend on S), we have 


D P 1 r) 
Lo(Q) <Ls(0)+ | APRA 
where 

def 


D(QIIP)= 


EB, [in(Q(ny/P(H))] 


is the Kullback-Leibler divergence. 


Proof. For any function f(S), using Markov’s inequality: 


P[f(S) > €]=P[eM > e*] = Ese) (31.1) 


ef 


Let A(h) = Lp(h) — Ls(h). We will apply Equation (31.1) with the function 


= m— Di Ps : 
(8) =sup (20m—1),E (amy? D(OIIP)) 


We now turn to bound Es [e/]. The main trick is to upper bound f(S) by using an 
expression that does not depend on Q but rather depends on the prior probability 
P.To do so, fix some S and note that from the definition of D(Q||P) we get that for 
all Q, 


2(m — 1) ,E, (A(h))? — D(QIIP) = , [In (e"")4" P(A)/O(h))| 


Zin my [er DA” p(n) O(n) 


=In Ee Hayy, (31.2) 


where the inequality follows from Jensen’s inequality and the concavity of the log 
function. Therefore, 


x 7 7” m i] 2 
Ele] Z EE. [er ACh) l. (31.3) 


The advantage of the expression on the right-hand side stems from the fact that 
we can switch the order of expectations (because P is a prior that does not depend 
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on S), which yields 


7 7 m 1 2 
E [ef] 2 EE [er 1)A(h) i: (31.4) 


Next, we claim that for all h we have Es[e2"-)4()"] < m. To do so, recall that 
Hoeffding’s inequality tells us that 


P[A(h) >«] < en ome” 


This implies that Es [e2"-)4@"] < m (see Exercise 31.1). Combining this with 
Equation (31.4) and plugging into Equation (31.1) we get 


P[s(S)2e] < =. (31.5) 


Denote the right-hand side of the above 6, thus « = In(m/65), and we therefore obtain 
that with probability of at least 1 — 6 we have that for all Q 


2(m— 1) E (A(h))° — D(QIIP) s€=In(m/). 


Rearranging the inequality and using Jensen’s inequality again (the function x? is 


convex) we conclude that 


a In(m/6) + D(Q||P) 
(2,00) =z (amy seo. (31.6) 


oO 


Remark 31.1 (Regularization). The PAC-Bayes bound leads to the following 
learning rule: 


Given a prior P, return a posterior Q that minimizes the function 


D(Q\||P)+Inm/65 


L 31.7 
s(Q)+ 2(m —1) (31.7) 

This rule is similar to the regularized risk minimization principle. That is, we jointly 

minimize the empirical loss of Q on the sample and the Kullback-Leibler “distance” 


between QO and P. 


31.2 BIBLIOGRAPHIC REMARKS 


PAC-Bayes bounds were first introduced by McAllester (1998). See also 
(McAllester 1999, McAllester 2003, Seeger 2003, Langford & Shawe-Taylor 2003, 
Langford 2006). 


31.3 EXERCISES 


31.1 Let X be a random variable that satisfies P[X > «] < e-2m Prove that 
E[e2(™ DX) < m. 
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31.2 Mi Suppose that H is a finite hypothesis class, set the prior to be uniform over H, 
and set the posterior to be Q(hs) =1 for some hs and Q(h) = 0 for all other 


h EH. Show that 
Lo(hs) < Ls(h)+ a 


Compare to the bounds we derived using uniform convergence. 
M™ Derive a bound similar to the Occam bound given in Chapter 7 using the PAC- 
Bayes bound 
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Technical Lemmas 


Lemma A.1. Let a > 0. Then: x > 2alog(a) = x > alog(x). It follows that a 
necessary condition for the inequality x < alog(x) to hold is that x < 2alog(a). 


Proof. First note that for a € (0, ./e] the inequality x > alog (x) holds uncondition- 
ally and therefore the claim is trivial. From now on, assume that a > ./e. Consider 
the function f(x) =x —alog(x). The derivative is f’(x) =1-—a/x. Thus, for x >a 
the derivative is positive and the function increases. In addition, 


f (2alog(a)) = 2a log (a) — alog (2alog(a)) 
= 2alog (a) — alog(a) — alog (2log(a)) 
= alog(a) — alog(2log(a)). 
Since a — 2log(a) > 0 for all a > 0, the proof follows. Oo 
Lemma A.2. Leta>1andb>0. Then: x > 4alog(2a)+2b > x > alog(x) +b. 


Proof. It suffices to prove that x > 4alog(2a)+ 2b implies that both x > 2a log (x) 
and x > 2b. Since we assume a > 1 we clearly have that x > 2b. In addition, since 
b> 0 we have that x > 4a log (2a) which using Lemma A.1 implies that x > 2alog (x). 
This concludes our proof. O 


Lemma A.3. Let X be a random variable and x' € R be a scalar and assume that 
there exists a > 0 such that for all t > 0 we have P[|X — x'| > t] < 2e-*/“. Then, 
[|X —x'|] < 4a. 


Proof. For alli =0,1,2,... denote t; = ai. Since f; is monotonically increasing we 
have that E[|X — x’|] is at most )°°2, 4; P[|X — x’| > 1]. Combining this with the 
assumption in the lemma we get that E[|X —x’|] < 2a 50%, ie~"-1", The proof 
now follows from the inequalities 


= -_4)2 : ._4)2 ae 2 
Scie) < yee” +f xe" @- dx < 1.841077 <2. 
= i=1 5 
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Lemma A.4. Let X be a random variable and x' € R be a scalar and assume that 
there exists a > 0. and b > e such that for all t > 0 we have P[|X —x'| >t] <2be"/™. 


Then, E[|X —x'|] < a(2+ \/log(b)). 


Proof. For all i = 0,1,2,... denote t; = a(i + \/log(b)). Since ft; is monotonically 
increasing we have that 


E[|X —x'|] <av/log(b) + Sj P[|X —x'| > 4-1]. 


i=l 
Using the assumption in the lemma we have 
Co CO ; > 
So P(X —x'| > G1] <2ab S— (i+ Vlog (bye V8 O) 
i=1 i=l 


< 2ab [ xe @- dy 
1+./log (b) 


CO 
=2ab f (y + le dy 
Vlog (b) 


Cc 
<4ab | ye? dy 
a/ log (b) 


=o [-e"] _ 


=2ab/b=2a. 
Combining the preceding inequalities we conclude our proof. oO 
Lemma A.5. Let m,d be two positive integers such that d < m —2. Then, 


d 


y(t) <(Fy. 


k=0 


Proof. We prove the claim by induction. For d = 1 the left-hand side equals 1+ m 
while the right-hand side equals em; hence the claim is true. Assume that the claim 
holds for d and let us prove it for d+ 1. By the induction assumption we have 


¥ (0) = (FY +) 
=(<3)' (1 (S) epee) 


em 4 (m—d) 
= (FY (+(8) aie): 
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Using Stirling’s approximation we further have that 


em\d aye (m—d) 
=(F) (+(2) pes 


d m—d 
(14 J20 aa) 
em\¢ d+1+(m—d)/V2n 


eg, 
eh d+1 
(7) 
oak 


_ (emt dt+1+(m—a)/2 
~X\d d+] 
em\d d/2+1+m/2 
=a? @ek 
as) 
d d+1’ 
where in the last inequality we used the assumption that d < m — 2. On the other 
hand, 
em em\4 em a 8 
(a) a) GaGa) 
em\4 em 1 
5 ee, “d+1 (+1/d)4 
= (Ye 
~X\d d+1e 
em\¢ m 
are rent 
which proves our inductive argument. O 


Lemma A.6. For alla € R we have 


a —a 
e+e 2/2 


i) 
| 


Proof. Observe that 


n=0 
Therefore, 
ef te 4 a qe” 
20 = (2n)!’ 
and oo 
a? /2 _ a” 
. = +S 2nnl? 
n=0 
Observing that (2n)! > 2”n! for every n > 0 we conclude our proof. O 
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Measure Concentration 


Let Z1,...,Zm be an iid. sequence of random variables and let w be their mean. 
The strong law of large numbers states that when m tends to infinity, the empirical 
average, 1 vy, Zi, converges to the expected value jz, with probability 1. Measure 
concentration inequalities quantify the deviation of the empirical average from the 


expectation when m is finite. 


B.1 MARKOV’S INEQUALITY 


We start with an inequality which is called Markov’s inequality. Let Z be a 
nonnegative random variable. The expectation of Z can be written as follows: 


CO 
Iz\= | P[Z>x]dx. (B.1) 
x= 
Since P[Z > x] is monotonically nonincreasing we obtain 
a a 
Va>0, Izlz | P[Z>x]dx> [ P[Z>aldx =aP[Z =a]. (B.2) 
 — x=0 


Rearranging the inequality yields Markov’s inequality: 


Va>0, P[Z>a] < _ (B.3) 


For random variables that take value in [0,1], we can derive from Markov’s 
inequality the following. 


Lemma B.1. Let Z be a random variable that takes values in [0,1]. Assume that 
[Z] = uw. Then, for any a € (0, 1), 


>» Hada) 


P[Z>1-a] 
a 


This also implies that for every a € (0,1), 


P[Z>a]>= 5—">p-a. 
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Proof. Let Y = 1— Z. Then Y is a nonnegative random variable with 
#[Y]=1—E[Z]=1-w. Applying Markov’s inequality on Y we obtain 
i[Y] 1- 
a a 


Therefore, 


l-w atp-l 
a ~ a : 


P[zZs1=a]si= 


B.2 CHEBYSHEV’S INEQUALITY 


Applying Markov’s inequality on the random variable (Z — E[Z])* we obtain 
Chebyshev’s inequality: 


Var[Z 
Va>0, PIIZ-E[zliza]=P[(z-E[z)P2a}< 24 wy 
a 

where Var[Z] = E[(Z — E[Z])’] is the variance of Z. 

Consider the random variable al Z;. Since Z1,...,Zm are i.1.d. it is easy to 
verify that 

m 
Var[Z 
Var 3 Z; | = a 

Applying Chebyshev’s inequality, we obtain the following: 
Lemma B.2. Let Z1,...,Zm be a sequence of i.i.d. random variables and assume 


that E[Z1] = and Var[Z,] < 1. Then, for any 5 € (0,1), with probability of at least 


1—5 we have 
1. 1 
1y%7,-u] sy 
ar om 


Proof. Applying Chebyshev’s inequality we obtain that for all a > 0 


ne Var[Z1] if 
P||—)> Z- < <—. 
m De a | ~ ma* ~ ma? 
The proof follows by denoting the right-hand side 6 and solving for a. O 


The deviation between the empirical average and the mean given previously 
decreases polynomially with m. It is possible to obtain a significantly faster decrease. 
In the sections that follow we derive bounds that decrease exponentially fast. 


B.3 CHERNOFF’S BOUNDS 


Let Z1,..., Zm be independent Bernoulli variables where for every i, P[Z; =1]= p 
and P[Z; =0]=1- p;. Let p= °y", pj and let Z = 7", Z;. Using the monotonicity 
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of the exponent function and Markov’s inequality, we have that for every t > 0 


P[Z > (1+5)p] =P[e’2 > e+9)?] < 


A 


Next, 


a 


D[e'7] = De! hi Zi) = 


bfe!4'] 


o[e'7] 
= G(+8)tp* 


U 


by independence 


= I] (pie' +(1=- pide?) 
= [+ pile’ —1)) 
<]pene™ 


= emi pi(e'—1) 


— e(-Dep. 


using 1+x <e* 


(B.5) 


Combining the equation with Equation (B.5) and choosing t = log(1 +5) we obtain 


Lemma B.3. Let Z},.. 


.,;Zm be independent Bernoulli variables where for every i, 


P[Z; =1] = p; and P[Z;=0| =1 = py. Let p= iy pi and let Z =)", Z;. Then, 


for any & > 0, 


where 


P[Z >(1+6)p] < e*)?, 


h(s) = (1+46)log(1+46)—6. 


Using the inequality h(a) > a?/(2+2a/3) we obtain 


Lemma B.4._ Using the notation of Lemma B.3 we also have 


2 
P[Z>(14+6)p] < e P22. 


For the other direction, we apply similar calculations: 


P[Z <(1—4)p] =P[-Z > —-(1—-8)p]=P[e 2 > a | < 


i[e!7] 
— e-(1-5)tp’ 
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B.4 Hoeffding’s Inequality 


and 


a[e (2) = [e Zi] = |” 


=|| i[e*'] by independence 


= II (1 + pile’ — 1)) 
< [er using 1+ x <e* 


= oe t-Dp, 


Setting t = —log(1 — 4) yields 
e oP 
e(1—8) log (1-8) p 


It is easy to verify that h( — 6) > h(5) and hence 


P[Z <(1—8)p] < eet) 


=e 


Lemma B.5. Using the notation of Lemma B.3 we also have 


2 
P[Z <(1—8)p] < e PhO) < Ph) <e? Ta, 


B.4 HOEFFDING’S INEQUALITY 


Lemma B.6 (Hoeffding’s Inequality). Let Z1,...,Zm be a sequence of i.i.d. random 
variables and let Z = yy Z;. Assume that E[Z] = uw and P[a < Z; < b] =1 for 
every i. Then, for any € > 0 


| 


Proof. Denote X; = Z; — E[Z;] and X = 1S; X;. Using the monotonicity of the 
exponent function and Markov’s inequality, we have that for every 1 > 0 ande > 0, 


m 


tN ¢Zi-u 


i=1 


> J < 2exp (—2m€?/(b—a)”) : 


P[X >] =P[e** >] <e* Efe**]. 


Using the independence assumption we also have 


h[e**] = |r| =|] ae"), 


i 


By Hoeffding’s lemma (Lemma B.7 later), for every i we have 


2 (b—a)? 
ee) <e Sm2 


Therefore, 
a2 (b-a)* 2 (b—a)” 


P[X > €] < e€ Ile 8m2 = eet 8m 
i 
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Setting A = 4me/(b — a)” we obtain 


_ _ 2me2 
P[X>e]<e O, 


Applying the same arguments on the variable —X we obtain that P[X < —e] < 
2me* 


e (-«. The theorem follows by applying the union bound on the two cases. O 


Lemma B.7 (Hoeffding’s Lemma). Let X be a random variable that takes values in 
the interval [a, b] and such that E[X]=0. Then, for every 4 > 0, 


»2(b—a)? 
—e 


i[e**]<e 


Proof. Since f(x) = e** is a convex function, we have that for every a € (0,1), and 
x €[a, b], 


f(x) saf(ay+(—a)f(d). 
Setting a = pox € [0, 1] yields 


b-x x-—a 
et < ent 4 ee. 
b-a b-a 


Taking the expectation, we obtain that 


b—-E[X] i[x]—a b a 
tek] < da Ab ha Ab 
lewls b-a ai b-a 7 b—a b-a ~ 
where we used the fact that E[X] = 0. Denote h = 4(b—a), p = 5-4, and 
L(h) = —hp + log(1 — p + pe’). Then, the expression on the right-hand side of 
the equation can be rewritten as e/(), Therefore, to conclude our proof it suf- 


fices to show that L(h) < is This follows from Taylor’s theorem using the facts 
L(0) = L'(0) =0 and L"(h) < 1/4 for all A. O 


B.5 BENNET’S AND BERNSTEIN’S INEQUALITIES 


Bennet’s and Bernsein’s inequalities are similar to Chernoff’s bounds, but they 
hold for any sequence of independent random variables. We state the inequalities 
without proof, which can be found, for example, in Cesa-Bianchi and Lugosi (2006). 


Lemma B.8 (Bennet’s Inequality). Let Z;,..., Zm be independent random variables 
with zero mean, and assume that Z; <1 with probability 1. Let 


1 m 
o*° >—)S-E[Z}]. 


Then for all € > 0, 


where 


h(a) =(1+a)log(1+a)—a. 
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B.5 Bennet’s and Bernstein’s Inequalities 
By using the inequality h(a) > a*/(2+2a/3) it is possible to derive the following: 


Lemma B.9 (Bernstein’s Inequality). Let Z1,...,Zm be i.i.d. random variables with 
a zero mean. If for alli, P(|Z;| <M) =1, then for all t > 0: 


m $7/2 
P S- Zi >fl\< exp "SEZ + M1/3 . 
i=1 LE famed! 


B.5.1 Application 


Bernstein’s inequality can be used to interpolate between the rate 1/e we derived 
for PAC learning in the realizable case (in Chapter 2) and the rate 1/e* we derived 
for the unrealizable case (in Chapter 4). 


Lemma B.10. Let €: H x Z — [0,1] be a loss function. Let D be an arbitrary 
distribution over Z. Fix some h. Then, for any 5 € (0,1) we have 


te ist) = L004 Eo OsCh), Hes) ee 
pan Lot > Ls(h)+ ae s@yres tl) ~ es) = 


Proof. Define random variables a1,...,@ S.t. a; = €(h,z;) — Lp(h). Note that 
’[a;] = 0 and that 


nA 


[a7] = E[e(h, z;)?] —2L (A) E[(h, z:)] + Lo(ay 
D[e(h, zi)?] — Lo(hy* 

= [e(h, zi)"] 

<E[¢(h, zi)] = Lo(h), 


IA 


where in the last inequality we used the fact that ¢(h, z;) € [0,1] and thus ¢(h, z;)* < 
£(h, z;). Applying Bernsein’s inequality over the a;’s yields 


< ex ae = 5 
Pn Ep(h)4t/3} 
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Solving for t yields 
jo 
mLp(h)+t/3 soe! 7) 
=> 22-280), tog(41/8)mLo(h) =0 
2 
= C0), 108 C1) 4 2108(1/8)mLo(h) 
<2 80) Dog (1/8)m Lp(h) 


Since 13; a; = Ls(h) — Lp(h), it follows that with probability of at least 1 — 6, 


Es(h)— Lp(h) < pee?) : — 


which proves the first inequality. The second part of the lemma follows in a 
similar way. oO 


B.6 SLUD’S INEQUALITY 


Let X be a (m, p) binomial variable. That is, X = 5~?"., Z;, where each Z; is 1 with 
probability p and 0 with probability 1 — p. Assume that p = (1—€)/2. Slud’s inequal- 
ity (Slud 1977) tells us that P[X > m/2] is lower bounded by the probability that 
a normal variable will be greater than or equal to \/me*/(1 —€?). The following 
lemma follows by standard tail bounds for the normal distribution. 


Lemma B.11. Let X be a (m, p) binomial variable and assume that p = (1 — €)/2. 
Then, 


P[X >m/2]> ; (1 1—exp(—me?/(1 &)) ; 


B.7 CONCENTRATION OF x? VARIABLES 


Let X1,...,Xx% be k independent normally distributed random variables. That is, 
for alli, X; ~ N(0,1). The distribution of the random variable X : is called x* (chi 
square) and the distribution of the random variable Z = X7+---+X7 is called x? (chi 
square with k degrees of freedom). Clearly, E[X?] = 1 and E[Z] =k. The following 
lemma states that X? is concentrated around its mean. 


Lemma B.12. Let Z ~ x7. Then, for all € > 0 we have 


P[Z <(1—e)k] <e* 6, 
and for all « € (0,3) we have 


P[Z > (1+e)k] <e© */®, 
Finally, for all « € (0,3), 
P[(l1-e)k <Z <(1+e)k] = 1-20 *®, 
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B.7 Concentration of x2 Variables 


Proof. Let us write Z = S~t_, X? where X; ~ N(0,1). To prove both bounds we 
use Chernoff’s bounding method. For the first inequality, we first bound E[e AXT], 
where 4 > 0 will be specified later. Since e~? <1—a+ a for all a > 0 we have that 


ie "|< 1-38 [1s 8, 


Using the well known equalities, E[X{] = 1 and E[X{] = 3, and the fact that 1—a< 
e “ we obtain that 


ie ei1-1s ee, 
Now, applying Chernoff’s bounding method we get that 


P[-Z>-(1—e)k] =P le > eto 


< ello) ae | 


— p(l—e)ka ( ; e al " 


< o(l-©)ka pak 307K 


3442 
— @ eka GRA . 


Choose i = €/3 we obtain the first inequality stated in the lemma. 
For the second inequality, we use a known closed form expression for the 
moment generating function of a x? distributed random variable: 


Va<}, Ele] = -2a)#?. (B.7) 
On the basis of the equation and using Chernoff’s bounding method we have 


P[Z>(1+.e)k)]=P tg > elt) 


2 oe (+e )kA 7 [7] 


=— oe (+e )kA (1 = aj Ve 


< ete )ka ek = em, 

where the last inequality occurs because (1 — a) < e. Setting A = €/6 (which is in 

(0, 1/2) by our assumption) we obtain the second inequality stated in the lemma. 
Finally, the last inequality follows from the first two inequalities and the union 

bound. O 
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Linear Algebra 


C.1 BASIC DEFINITIONS 


In this chapter we only deal with linear algebra over finite dimensional Euclidean 
spaces. We refer to vectors as column vectors. 
Given two d dimensional vectors u, v € R¢, their inner product is 


The Euclidean norm (a.k.a. the £2 norm) is ||u|| = ./(u, u). We also use the ¢; norm, 
Julla = sa |u;| and the €.5 norm |u|]. = max; |u;|. 

A subspace of R¢ is a subset of R? which is closed under addition and scalar 
multiplication. The span of a set of vectors w1,...,ux is the subspace containing all 
vectors of the form 


k 
S Qj Uj 
i=1 


where for alli, a; €R. 

A set of vectors U = {w,..., ux} is independent if for every i, u; is not in the span 
of w1,...,Uj—1, Uj41,..., Ux. We say that U spans a subspace V if V is the span of the 
vectors in U. We say that U is a basis of V if it is both independent and spans V. The 
dimension of V is the size of a basis of V (and it can be verified that all bases of V 
have the same size). We say that U is an orthogonal set if for alli 4 j, (uj,u;) = 0. 
We say that U is an orthonormal set if it is orthogonal and if for every /, ||u;|| = 1. 

Given a matrix A ¢ R’, the range of A is the span of its columns and the null 
space of A is the subspace of all vectors that satisfy Au = 0. The rank of A is the 
dimension of its range. 

The transpose of a matrix A, denoted A‘, is the matrix whose (i, j) entry equals 
the (j,i) entry of A. We say that A is symmetric if A= A’. 
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C.4 Singular Value Decomposition (SVD) 


C.2 EIGENVALUES AND EIGENVECTORS 


Let A € R@“@ be a matrix. A nonzero vector u is an eigenvector of A with a 
corresponding eigenvalue A if 


Au= Au. 


Theorem C.1 (Spectral Decomposition). If A €¢ R@ is a symmetric matrix of rank 
k, then there exists an orthonormal basis of R¢, w,,...,Wg, such that each u,; is an 
eigenvector of A. Furthermore, A can be written as A = aa diuju;, where each 
A; is the eigenvalue corresponding to the eigenvector u;. This can be written equiv- 
alently as A= UDU', where the columns of U are the vectors w,...,ug, and D is 
a diagonal matrix with Dj; =; and for i # j, D;,; = 0. Finally, the number of 4; 
which are nonzero is the rank of the matrix, the eigenvectors which correspond to the 
nonzero eigenvalues span the range of A, and the eigenvectors which correspond to 


zero eigenvalues span the null space of A. 


C.3 POSITIVE DEFINITE MATRICES 


A symmetric matrix A €¢ R@ is positive definite if all its eigenvalues are positive. A 
is positive semidefinite if all its eigenvalues are nonnegative. 


Theorem C.2. Let A ¢ R“4 be asymmetric matrix. Then, the following are equivalent 
definitions of positive semidefiniteness of A: 


M® All the eigenvalues of A are nonnegative. 
| For every vector u, (u, Au) > 0. 
® There exists a matrix B such that A= BB". 


C.4 SINGULAR VALUE DECOMPOSITION (SVD) 


Let Ae R”” be a matrix of rank r. When m £n, the eigenvalue decomposition given 
in Theorem C.1 cannot be applied. We will describe another decomposition of A, 
which is called Singular Value Decomposition, or SVD for short. 

Unit vectors v € R” and ue R” are called right and left singular vectors of A with 
corresponding singular value o > 0 if 


Av=ou and Alu=ov. 


We first show that if we can find r orthonormal singular vectors with positive 
singular values, then we can decompose A = UDV', with the columns of U and V 
containing the left and right singular vectors, and D being a diagonal r x r matrix 
with the singular values on its diagonal. 


Lemma C.3. Let A € R”” be a matrix of rank r. Assume that v1,...,V; is an 
orthonormal set of right singular vectors of A, W,...,U, is an orthonormal set of cor- 
responding left singular vectors of A, and 04,...,0, are the corresponding singular 
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values. Then, 
: 
A= S- ojujy; . 
i=1 


It follows that if U is a matrix whose columns are the u;’s, V is a matrix whose columns 
are the y;’s, and D is a diagonal matrix with D;,; = 0;, then 


A=UDV'". 


Proof. Any right singular vector of A must be in the range of A! (otherwise, the 
singular value will have to be zero). Therefore, v;,...,v, is an orthonormal basis 
of the range of A. Let us complete it to an orthonormal basis of R” by adding 
the vectors V,41,...,V,- Define B = yoy, . It suffices to prove that for all i, 
Av; = By;. Clearly, if i > r then Av; = 0 and By; = 0 as well. For i <r 
we have 


- 
By; = ) ojUjV) Vi =0;U; = AV;, 
j=l 


where the last equality follows from the definition. O 


The next lemma relates the singular values of A to the eigenvalues of A'A 
and AA’. 


Lemma C.4._ v,u are right and left singular vectors of A with singular value o iff 
v is an eigenvector of A' A with corresponding eigenvalue o* and u= 07! Av is an 
eigenvector of AA! with corresponding eigenvalue o?. 

Proof. Suppose that o is a singular value of A with v € R” being the corresponding 
right singular vector. Then, 


Al Av=oAlu=o’v. 
Similarly, 
AAlu=oAv=o’u. 


For the other direction, if 4 4 0 is an eigenvalue of A' A, with v being the cor- 
responding eigenvector, then 4 > 0 because A'A is positive semidefinite. Let 
o =V/i,u=07! Av. Then, 


ley 
ou= — = Av, 
Ji 
and 
a: Lace 
A'u=—A’' AV=—V=oVv 
oO 


oO 


Finally, we show that if A has rank r then it has r orthonormal singular vectors. 
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Lemma C.5. Let A € R”” with rank r. Define the following vectors: 


vi = argmax ||Av|| 
veR":||v||=1 


v2 = argmax ||Av|| 
veR":||v||=1 
(v,vj)=0 


v;= argmax ||Av|| 
veR":||v|J=1 
Vi<r, (v,vj)=0 


Then, V1, ...,V; is an orthonormal set of right singular vectors of A. 


Proof. First note that since the rank of A is r, the range of A is a subspace of 
dimension r, and therefore it is easy to verify that for alli =1,...,r, ||Av;|| > 0. 
Let W € R”” be an orthonormal matrix obtained by the eigenvalue decompo- 
sition of A'A, namely, A4'A = WDW', with D being a diagonal matrix with 
Di > D22 >--- => 0. We will show that v1,...,v, are eigenvectors of A'A that 
correspond to nonzero eigenvalues, and, hence, using Lemma C.4 it follows that 
these are also right singular vectors of A. The proof is by induction. For the basis of 
the induction, note that any unit vector v can be written as v = Wx, forx = Wy, 
and note that ||x|| = 1. Therefore, 


n 
Av? = |AWx|)? = ||WDW! Wx|?? = ||WDx|)? = |Dx|? =) DP x7. 
i=l 


Therefore, 


n 
max || Av||7 = max Dae. 
v:||v||=1 x:||x||=1 4 : 
j= 
The solution of the right-hand side is to set x = (1,0,...,0), which implies that vj is 
the first eigenvector of A' A. Since ||Avj|| > 0 it follows that D,; > 0 as required. 
For the induction step, assume that the claim holds for some 1 < t < r—1. Then, 
any v which is orthogonal to vw1,...,v; can be written as v = Wx with all the first r 
elements of x being zero. It follows that 


max | Av? = max Dex 

v:||v||=1,Vi<t,v' v;=0 =|x[=1 = 

The solution of the right-hand side is the all zeros vector except x;4; = 1. This 
implies that v,+1 is the (t+ 1)th column of W. Finally, since || Av,+1|| > 0 it follows 
that D;+1,+41 > 0 as required. This concludes our proof. O 


Corollary C.6 (The SVD Theorem). Let A € R”” with rank r. Then A= UDV! 
where D is anr xr matrix with nonzero singular values of A and the columns of U,V 
are orthonormal left and right singular vectors of A. Furthermore, for all i, De is 
an eigenvalue of A' A, the ith column of V is the corresponding eigenvector of A' A 
and the ith column of U is the corresponding eigenvector of AA‘. 
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predictor, 14 
prefix free language, 64 
Principal Component Analysis, see PCA 
prior knowledge, 39 
Probably Approximately Correct, see PAC 
projection, 159 
projection lemma, 159 
proper, 28 
pruning, 216 


Rademacher complexity, 325 
random forests, 217 
random projections, 283 
ranking, 201 

bipartite, 206 
realizability, 17 
recall, 206 
regression, 26, 94, 138 
regularization, 137 

Tikhonov, 138, 140 
regularized loss minimization, see RLM 
representation independent, 28, 80 
representative sample, 31, 325 
representer theorem, 182 
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ridge regression, 138 
kernel ridge regression, 188 
RIP, 286 
risk, 14, 24, 26 
RLM, 137, 164 


sample complexity, 22 
Sauer’s lemma, 49 
self-boundedness, 130 
sensitivity, 206 
SGD, 156 
shattering, 45, 352 
single linkage, 267 
Singular Value Decomposition, see SVD 
Slud’s inequality, 378 
smoothness, 129, 143, 163 
SOA, 250 
sparsity-inducing norms, 315 
specificity, 206 
spectral clustering, 271 
SRM, 60, 115 
stability, 139 
Stochastic Gradient Descent, see SGD 
strong learning, 102 
Structural Risk Minimization, see SRM 
structured output prediction, 198 
subgradient, 154 
Support Vector Machines, see SVM 
SVD, 381 
SVM, 167, 333 
duality, 175 
generalization bounds, 172, 333 
hard-SVM, 168, 169 
homogenous, 170 
kernel trick, 181 
soft-SVM, 171 
support vectors, 175 


target set, 26 

term frequency, 194 
TF-IDF, 194 
training error, 15 
training set, 13 

true error, 14, 24 


underfitting, 41, 121 
uniform convergence, 31, 32 
union bound, 19 
unsupervised learning, 265 


validation, 114, 116 
cross validation, 119 
train-validation-test split, 120 
Vapnik-Chervonenkis dimension, see VC 
dimension 
VC dimension, 43, 46 
version space, 247 
Viola-Jones, 110 


weak learning, 101, 102 
Weighted-Majority, 252 
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